Genetic Algorithm: Grouping students without knowing exact number of groups - grouping

I have students with defined levels. Some of the students are in the groups from previous week, some of them are new. Students from previous week should be kept in their groups.
Group has a level which is calculated by the average of containing student's levels. New student can be added to the group if difference of student level and group level is less than defined limit(for example 3). There is also minimum and maximum group sizes. If there is no enough space in the group we should create new one.
I have tried to solve with clustering algorithms(hierarchical and non hierarchical) but non of them works for my case.
I need to create minimum amount of groups.
I would like to know will genetic algorithm work . The genes of a chromosome would represent a single student and their assignment to a class. Fitness function part will use all constraints(max group size, min group size).
As I understood for applying genetic algorithm I need to know groups count which is not clear in my case. Any ideas?

Yes, genetic algorithm can work. I'm not sure where you get the idea that you have to know the quantity of groups. All a genetic algorithm needs is a generator for making children, a fitness function to judge which children are the best, and a few quantity parameters (how many to keep as parents for the next generation, how many children to produce, ... things that are in the generator).
I suggest that your individuals ("chromosomes") be a list of the groups for the new generation. To save time, your generator should yield only viable children: those that fulfill the group-size requirements. Any child that does not satisfy those should be skipped and replaced.
The main work in this scenario is setting up a generator that knows how to split groups: when you find that a new student requires a new group, then you have to draw min_group_size-1 students from other groups. If you have the entire population of new students at once, then you can make global decisions, instead.
Is this enough to move you in a useful direction?
Update per user comment:
You cannot guarantee finding the optimal answer with a genetic algorithm.
The number of chromosomes depends on what works best for you. You need to handle a variety of possible group assignments, as well as new groups. Here is where you have to experiment; welcome to machine learning.
I would start investigating with a "comfortable" number of chromosomes, perhaps the quantity of groups times sqrt(quantity of new students). Depending on time constraints, I'd think that somewhere from 20 to 200 chromosomes would be good for you. Your critical measures of success are how often it finds a great solution, and how much time you spend finding it.

Yes, forming students groups can be done with the help of optimization. Genetic Algorithm (GA) is not the only one optimization algorithm that has been applied to the specific problem, but also Particle Swarm Optimization (PSO). In a recent research, a PSO was implemented to classify students under unknown number of groups. PSO showed improved capabilities compared to GA. I think that all you need is the specific research.
The paper is: Forming automatic groups of learners using particle swarm optimization for applications of differentiated instruction
You can find the paper here: https://doi.org/10.1002/cae.22191
Perhaps the researchers could guide you through researchgate:
https://www.researchgate.net/publication/338078753
As I can see, the researchers used the characteristics of each cluster as a solution vector (chromosomes) combined with a threshold number determining the number of groups (very interesting - I think this is exactly what you need).
I hope I have helped you

Related

How to implement TSP with dynamic in C++

Recently I asked a question on Stack Overflow asking for help to solve a problem. It is a travelling salesman problem where I have up to 40,000 cities but I only need to visit 15 of them.
I was pointed to use Dijkstra with a priority queue to make a connectivity matrix for the 15 cities I need to visit and then do TSP on that matrix with DP. I had previously only used Dijkstra with O(n^2). After trying to figure out how to implement Dijkstra, I finally did it (enough to optimize from 240 seconds to 0.6 for 40,000 cities). But now I am stuck at the TSP part.
Here are the materials I used for learning TSP :
Quora
GeeksForGeeks
I sort of understand the algorithm (but not completely), but I am having troubles implementing it. Before this I have done dynamic programming with arrays that would be dp[int] or dp[int][int]. But now when my dp matrix has to be dp[subset][int] I don't have any idea how should I do this.
My questions are :
How do I handle the subsets with dynamic programming? (an example in C++ would be appreciated)
Do the algorithms I linked to allow visiting cities more than once, and if they don't what should I change?
Should I perhaps use another TSP algorithm instead? (I noticed there are several ways to do it). Keep in mind that I must get the exact value, not approximate.
Edit:
After some more research I stumbled across some competitive programming contest lectures from Stanford and managed to find TSP here (slides 26-30). The key is to represent the subset as a bitmask. This still leaves my other questions unanswered though.
Can any changes be made to that algorithm to allow visiting a city more than once. If it can be done, what are those changes? Otherwise, what should I try?
I think you can use the dynamic solution and add to each pair of node a second edge with the shortest path. See also this question:Variation of TSP which visits multiple cities.
Here is a TSP implementation, you will find the link of the implemented problem in the post.
The algorithms you linked don't allow visiting cities more than once.
For your third question, I think Phpdna answer was good.
Can cities be visited more than once? Yes and no. In your first step, you reduce the problem to the 15 relevant cities. This results in a complete graph, i.e. one where every node is connected to every other node. The connection between two such nodes might involve multiple cities on the original map, including some of the relevant ones, but that shouldn't be relevant to your algorithm in the second step.
Whether to use a different algorithm, I would perhaps do a depth-first search through the graph. Using a minimum spanning tree, you can give an upper and lower bound to the remaining cities, and use that to pick promising solutions and to discard hopeless ones (aka pruning). There was also a bunch of research done on this topic, just search the web. For example, in cases where the map is actually carthesian (i.e. the travelling costs are the distance between two points on a plane), you can exploit this info to improve the algorithms a bit.
Lastly, if you really intend to increase the number of visited cities, you will find that the time for computing it increases vastly, so you will have to abandon your requirement for an exact solution.

Which Data Mining task to retrieve a unique instance

I'm working with data mining, and I'm familiar with classification, clustering and regression tasks. In classification, one can have a lot of instances (e.g. animals), their features (e.g. number of legs) and a class (e.g. mammal, reptile).
But what I need to accomplish is, given some attributes, including the class attribute, to determine which unique instance I'm referring to (e.g. giraffe). I can supply all known attributes that I have, and if the model can’t figure out the answer, it can ask for another attribute – just analogous to a 20 questions style of game.
So, my question is: does this specific task have a name? It seems to be similar to classification, where the class is unique to each instance, but this wouldn’t fit on the current training models, except perhaps for a decision tree model.
Your inputs, denoted features in machine learning, are tuples of species (what I think you mean by "instance"), and physical attributes. Your outputs are broader taxonomic ranks. Thus, assigning one to each input is a classification problem. Since your features are incomplete, you want to perform ... classification with incomplete data, or impute missing features. Searching for these terms will give you enough leads.
(And the other task is properly called clustering.)
IMHO you are looking for simply a decision tree.
Except, that you don't train it on your categorial attribute (your "class"), but on the individual instance label.
You need to carefully choose the splitting measure though, as many measures go for class sizes - and all your classes have size 1 now. Finding a good split for the decision tree may involve planning some splits ahead to get an optimal balanced tree. A random forest like approach may be of use to improve the chance of finding a good tree.

Cutting dendrogram at highest level of purity

I am trying to create program that cluster documents using hierarchical agglomerative clustering, and the output of the program depends on cutting the dendrogram at such a level that I get maximum purity.
So following is the algorithm I am working on right now.
Create dedrogram for the documents in the dataset
purity = 0
final_clusters
for all the levels, lvl, in the dendrogram
clusters = cut dendrogram at lvl
new_purity = calculate_purity_of(clusters)
if new_purity > purity
purity = new_purity
final_clusters = clusters
according to this algorithm I get the clusters at which the purity calculated is highest at all the levels.
The problem is, when I cut the dendrogram at lowest level, every cluster contains only one document, which means it is 100% pure, therefore average purity of clusters is 1.0. But this is not the desired output. What I want is proper grouping of documents. Am I doing something wrong?
You are using a too simple measure.
Yes, the "optimal" solution with respect to purity is to only merge duplicate objects, so that each cluster remains pure by definition.
This is why optimizing a mathematical criterion often isn't the right approach to tackle a real data problem. Instead, you need to ask yourself the question: "what would be an interesting result", where interesting is not the same as optimal in a mathematical sense.
Sorry that I cannot give you a better answer - but I don't have your data.
IMHO, any abstract mathematical approach will suffer from the same fate. You need to have your data and user needs specify what to cluster, not some statistical number; so don't look in mathematics for the answer, but look at your data and your user needs.
I know it's been a few years, but one potential way you can improve your results is to add a penalty component that increases with the number of clusters. That way, your "optimal setting" doesn't take the shortcut and instead gives you a more balanced solution.

Which Evolutionary Algorithm for optimization of binary problems?

In our program we use a genetic algorithm since years to sole problems for n variables, each having a fixed set of m possible values. This typically works well for ~1,000 variables and 10 possibilities.
Now i have a new task where only two possibilities (on/off) exist for each variable, but i'll probably need to solve systems with 10,000 or more variables. The existing GA does work but the solution improves only very slowly.
All the EA i find are designed rather for continuous or integer/float problems. Which one is best suited for binary problems?
Well, the Genetic Algorithm in its canonical form is among the best suited metaheuristics for binary decision problems. The default configuration that I would try is such a genetic algorithm that uses 1-elitism and that is configured with roulette-wheel selection, single point crossover (100% crossover rate) and bit flip mutation (e.g. 5% mutation probability). I would suggest you try this combination with a modest population size (100-200). If this does not work well, I would suggest to increase the population size, but also change the selection scheme to a tournament selection scheme (start with binary tournament selction and increase the tournament group size if you need even more selection pressure). The reason is that with a higher population size, the fitness-proportional selection scheme might not excert the necessary amount of selection pressure to drive the search towards the optimal region.
As an alternative, we have developed an advanced version of the GA and termed it Offspring Selection Genetic Algorithm. You can also consider trying to solve this problem with a trajectory-based algorithm like Tabu Search or Simulated Annealing that just uses mutation to move from one solution to another by just making small changes.
We have a GUI-driven software (HeuristicLab) that allows you to experiment with a number of metaheuristics on several problems. Your problem is unfortunately not included, but it's GPL licensed and you can implement your own problem there (through just the GUI even, there's a howto for that).
Like DonAndre said, canonical GA was pretty much designed for binary problems.
However...
No evolutionary algorithm is in itself a magic bullet (unless it has billions of years runtime). What matters most is your representation, and how that interacts with your mutation and crossover operators: together, these define the 'intelligence' of what is essentially a heuristic search in disguise. The aim is for each operator to have a fair chance of producing offspring with similar fitness to the parents, so if you have domain-specific knowledge that allows you to do better than randomly flipping bits or splicing bitstrings, then use this.
Roulette and tournament selection and elitism are good ideas (maybe preserving more than 1, it's a black art, who can say...). You may also benefit from adaptive mutation. The old rule of thumb is that 1/5 of offspring should be better than the parents - keep track of this quantity and vary the mutation rate appropriately. If offspring are coming out worse then mutate less; if offspring are consistently better then mutate more. But the mutation rate needs an inertia component so it doesn't adapt too rapidly, and as with any metaparameter, setting this is something of a black art. Good luck!
Why not try a linear/integer program?

appropriate minimum support for itemset?

Please suggest me for any kind material about appropriate minimum support and confidence for itemset!
::i use apriori algorithm to search frequent itemset. i still don't know appropriate support and confidence for itemset. i wish to know what kinds of considerations to decide how big is the support.
The answer is that the appropriate values depends on the data.
For some datasets, the best value may be 0.5. But for some other datasets it may be 0.05. It depends on the data.
But if you set minsup =0 and minconf = 0, some algorithms will run out of memory before terminating, or you may run out of disk space because there is too many patterns.
From my experience, the best way to choose minsup and minconf is to start with a high value and then to lower them down gradually until you find enough patterns.
Alternatively, if you don't want to have to set minsup, you can use a top-k algorithms where instead of specifying minsup, you specify for example that you want the k most frequent rules. For example, k = 1000 rules.
If you are interested by top-k association rule mining, you can check my Java code here:
http://www.philippe-fournier-viger.com/spmf/
The algorithm is called TopKRules and the article describing it will be published next month.
Besides that, you need to know that there is many other interestingness measures beside the support and confidence: lift, all-confidence, ... To know more about this, you can read this article: "On selecting interestingness measures for association rules" and "A Survey of Interestingness Measures for Association Rules" Basically, all measures have some problems in some cases... no measure is perfect.
Hope this helps!
In any association rule mining algorithm, including Apriori, it is up to the user to decide what support and confidence values they want to provide. Depending on your dataset and your objectives you decide the minSup and minConf.
Obviously, if you set these values lower, then your algorithm will take longer to execute and you will get a lot of results.
The minimum support and minimum confidence parameters are a user preference. If you want a larger quantity of results (with lower statistical confidence), choose the parameters appropriately. In theory you can set them to 0. The algorithm will run, but it will take a long time, and the result will not be particularly useful, as it contains just about anything.
So choose them so that the result suit your needs. Mathematically, any value is "correct".