Weighted Random Samplers in pytorch - computer-vision

I am new to samplers and don't understand why we should use a weighted random sampler. Can anyone explain it to me? Also, should we use a weighted random sampler for the validation set?

This is very much a PyTorch-independent question and, as such might appear a bit off-topic.
Take a classification task, your dataset may contain more instances of a certain class making this class overrepresented. This can usually lead to some issues. Indeed, during training, your model will be presented with more instances from one class than the others. In that sense, it can become biased towards that prominent class.
To counter that you can use a weighted sampler that will effectively level the unequal number of instances such that, on average, during one epoch, the model will have seen as many examples belonging to each of your classes. This will allow having balanced learning with respect to your class, independently from the fact that you may have different numbers of instances per class.
To answer your second question, I don't think you should be using a weighted sampler on your validation. There is no need to adopt a specific sampling policy. The point of validation is to see the performance of your fixed model on unseen data. Similar to the test set, where you won't have access to the class statistics to actually use a weighted sampler.

Related

Best way to feature select using PCA (discussion)

Terminology:
Component: PC
loading-score[i,j]: the j feature in PC[i]
Question:
I know the question regarding feature selection is asked several times here at StackOverflow (SO) and on other tech-pages, and it proposes different answers/discussion. That is why I want to open a discussion for the different solutions, and not post it as a general question since that has been done.
Different methods are proposed for feature selection using PCA: For instance using the dot product between the original features and the components (here) to get their correlation, a discussion at SO here suggests that you can only talk about important features as loading-scores in a component (and not use that importance in the input space), and another discussion at SO (which I cannot find at the moment) suggest that the importance for feature[j] would be abs(sum(loading_score[:,j]) i.e the sum of the absolute value of loading_score[i,j] for all i components.
I personally would think that a way to get the importance of a feature would be an absolute sum where each loading_score[i,j] is weighted by the explained variance of component i i.e
imp_feature[j]=sum_i (abs(loading_score[i,j])*explained_variance[i].
Well, there is no universal way to select features; it totally depends on the dataset and some insights available about the dataset. I will provide some examples which might be helpful.
Since you asked about PCA, initially it separates the whole dataset into two sets under which the variances. On the other ICA (Independent Component Analysis) is able to extract multiple features simultaneously. Look at this example,
In this example, we mix three independent signals and try to separate out them using ICA and PCA. In this case, ICA is doing it a better way than PCA. In general, if you search Blind Souce Separation (BSS) you may find more information about this. Besides, in this example, we know the independent components thus, separation is easy. In general, we do not know the number of components. Hence, you may have to guess based on some prior information about the dataset. Also, you may use LDA (Linear Discriminate Analysis) to reduce the number of features.
Once you extract PC components using any of the techniques, following way we can visualize it. If we assume, those extracted components as random variables i.e., x, y, z
More information about you may refer to this original source where I took about two figures.
Coming back to your proposition,
imp_feature[j]=sum_i (abs(loading_score[i,j])*explained_variance[i]
I would not recommend this way due to the following reasons:
abs(loading_score[i,j]) when we get absolute values you may loose positive or negative correlations of considered features. explained_variance[i] may be used to find the correlation between features, but multiplying does not make any sense.
Edit:
In PCA, each component has its explained variance. Explained variance is the ratio between individual component variance and total variance(sum of all individual components variances). Feature significance can be measured by magnitude of explained variance.
All in all, what I want to say, feature selection totally depends on the dataset and the significance of features. PCA is just one technique. Frist understand the properties of features and the dataset. Then, try to extract features. Hope this helps. If you can provide us with an exact example, we may provide more insights.

Genetic Algorithm: Grouping students without knowing exact number of groups

I have students with defined levels. Some of the students are in the groups from previous week, some of them are new. Students from previous week should be kept in their groups.
Group has a level which is calculated by the average of containing student's levels. New student can be added to the group if difference of student level and group level is less than defined limit(for example 3). There is also minimum and maximum group sizes. If there is no enough space in the group we should create new one.
I have tried to solve with clustering algorithms(hierarchical and non hierarchical) but non of them works for my case.
I need to create minimum amount of groups.
I would like to know will genetic algorithm work . The genes of a chromosome would represent a single student and their assignment to a class. Fitness function part will use all constraints(max group size, min group size).
As I understood for applying genetic algorithm I need to know groups count which is not clear in my case. Any ideas?
Yes, genetic algorithm can work. I'm not sure where you get the idea that you have to know the quantity of groups. All a genetic algorithm needs is a generator for making children, a fitness function to judge which children are the best, and a few quantity parameters (how many to keep as parents for the next generation, how many children to produce, ... things that are in the generator).
I suggest that your individuals ("chromosomes") be a list of the groups for the new generation. To save time, your generator should yield only viable children: those that fulfill the group-size requirements. Any child that does not satisfy those should be skipped and replaced.
The main work in this scenario is setting up a generator that knows how to split groups: when you find that a new student requires a new group, then you have to draw min_group_size-1 students from other groups. If you have the entire population of new students at once, then you can make global decisions, instead.
Is this enough to move you in a useful direction?
Update per user comment:
You cannot guarantee finding the optimal answer with a genetic algorithm.
The number of chromosomes depends on what works best for you. You need to handle a variety of possible group assignments, as well as new groups. Here is where you have to experiment; welcome to machine learning.
I would start investigating with a "comfortable" number of chromosomes, perhaps the quantity of groups times sqrt(quantity of new students). Depending on time constraints, I'd think that somewhere from 20 to 200 chromosomes would be good for you. Your critical measures of success are how often it finds a great solution, and how much time you spend finding it.
Yes, forming students groups can be done with the help of optimization. Genetic Algorithm (GA) is not the only one optimization algorithm that has been applied to the specific problem, but also Particle Swarm Optimization (PSO). In a recent research, a PSO was implemented to classify students under unknown number of groups. PSO showed improved capabilities compared to GA. I think that all you need is the specific research.
The paper is: Forming automatic groups of learners using particle swarm optimization for applications of differentiated instruction
You can find the paper here: https://doi.org/10.1002/cae.22191
Perhaps the researchers could guide you through researchgate:
https://www.researchgate.net/publication/338078753
As I can see, the researchers used the characteristics of each cluster as a solution vector (chromosomes) combined with a threshold number determining the number of groups (very interesting - I think this is exactly what you need).
I hope I have helped you

Influence diagrams / Decision models in Stan and PyMC3

Is it possible to write decision-making models in either Stan or PyMC3? By that I mean: we define not only the distribution of random variables, but also the definition of decision and utility variables, and determine the decisions maximizing expected utility.
My understanding is that Stan is more of a general optimizer than PyMC3, so that suggests decision models would be more directly implemented in it, but I would like to hear what people have to say.
Edit: While it is possible to enumerate all decisions and compute their corresponding expected utility, I am wondering about more efficient methods since the number of decisions could be combinatorially too many (for example, how many items to buy from a list with thousands of products). Influence diagram algorithms exploit factorizations in the model to identify independences that allow computing of the decisions on only a smaller set of relevant random variables. I wonder if either Stan or PyMC3 do that kind of thing.
The basic steps for Bayesian decision theory are:
Enumerate a finite set of decisions that could be made
Specify a utility function of the decision and perhaps other things
Draw from the posterior distribution of all the unknowns given the known data
Evaluate the utility function for each possible decision and each posterior draw
Make the decision with the highest expected utility, averaging over the posterior draws.
You can do those five steps with any software --- Stan and PyMC3 included --- that produces (valid) draws from the posterior distribution. In Stan, the utility function should be evaluated in the generated quantities block.

Is there a way to quantify impact of independent variables with gradient boosting?

I've been asked to run a model using gradient boosting or random forest. So far so good, however, the only output that comes back in terms of variable importance is based on the number of times a variable was used as a branch rule. I've now been asked to basically get coefficients or somehow quantify the impact that the variables have on the target.
Is there a way to do this with a gradient boosting model? My other thoughts were to either use only the variables that were showed to be sued as branch rules in a regular decision tree or in a GLM or regular regression model.
Any help or ides would be appreciated!! Thanks so much!
Just to make certain there is not a misunderstanding: SAS implementation of decision tree/gradient boosting (at least in EM) uses Split-Based variable Importance.
Split-Based Importance does NOT count the number splits made.
It is the ratio of the reduction of sum-of-squares by one variable (specific the sum over all splits by this variable) in relation to the reduction of sum-of-squares achieved by all splits in the model.
If you are using surrogate rules, highly correlated variables will receive roughly the same value.

Which Data Mining task to retrieve a unique instance

I'm working with data mining, and I'm familiar with classification, clustering and regression tasks. In classification, one can have a lot of instances (e.g. animals), their features (e.g. number of legs) and a class (e.g. mammal, reptile).
But what I need to accomplish is, given some attributes, including the class attribute, to determine which unique instance I'm referring to (e.g. giraffe). I can supply all known attributes that I have, and if the model can’t figure out the answer, it can ask for another attribute – just analogous to a 20 questions style of game.
So, my question is: does this specific task have a name? It seems to be similar to classification, where the class is unique to each instance, but this wouldn’t fit on the current training models, except perhaps for a decision tree model.
Your inputs, denoted features in machine learning, are tuples of species (what I think you mean by "instance"), and physical attributes. Your outputs are broader taxonomic ranks. Thus, assigning one to each input is a classification problem. Since your features are incomplete, you want to perform ... classification with incomplete data, or impute missing features. Searching for these terms will give you enough leads.
(And the other task is properly called clustering.)
IMHO you are looking for simply a decision tree.
Except, that you don't train it on your categorial attribute (your "class"), but on the individual instance label.
You need to carefully choose the splitting measure though, as many measures go for class sizes - and all your classes have size 1 now. Finding a good split for the decision tree may involve planning some splits ahead to get an optimal balanced tree. A random forest like approach may be of use to improve the chance of finding a good tree.