How to write milp equation for this problem? - linear-programming

Consider the classic network flow problem where the constraint is that the outflow from a vertex is equal to the sum of the inflows to it. Consider having a more specific constraint where the flow can be split between edges.
I have two questions:
How can I use a decision variable to identify that node j is receiving items from multiple edges?
How to create another equation to determine the cost (2 unit of time per item) of joining x number of items from different edges in the sink node?

This is a tricky modeling question. Let's go by parts.
Consider having a more specific constraint where the flow can be split between edges
I here assume that you have a classic flow constraint modeled as a real variable set y_ij. Therefore, the flow can be split between two or more arcs.
How can I use a decision variable to identify that node j is receiving items from multiple edges?
You need to create an additional binary variable z_ij to represent your flow. You must also create the following constraint:
Next, you will need another additional integer variable set, let's say p_i and an additional constraint
Then, p_i will store the number of ingoing arcs in a node j which are used to send flow. Since you will try to minimize the cost of joining arcs (I think), you need to use the <=.
How to create another equation to determine the cost(2 unit of time per item) of joining x number of items from different edges in the sink node?
For this, you can use the value of p_i and multiply by the predefined cost of joining the flow.

Related

Classify K-means in Text Mining

The goal is to create a computer-generated news site that aggregates headlines from different news sources around the world:
Taking a look at the centroid table results I want to Understand the following:
https://ibb.co/n1mvnbk
I used K=5
and I am using TF-IDF
Explain what those numbers mean?
When an attribute is zero in multiple clusters, what does it mean?
When I sort the centroid table by each cluster at a descending order, I find some words or attributes that have a higher value with this cluster while zero values in other clusters. Does this mean that these words occur more or less frequently in this cluster?
How can I discuss the clustering model
Do all the clusters make sense and why?
Do you think k=5 is a good choice for this dataset? or I need to choose 3? How can I classify that?
I believe K=5 denotes number of cluster you are looking into current Dataset. On the basis 5 centroid will be placed in data will be around them.
Do you think k=5 is a good choice for this dataset? Its hard to predict this way. It is all done by mathematical combination and permutation.
You might use Elbow Method to identify correct number of cluster needed for any given dataset. This methodology is based on WCSS(Within Cluster Sums of Squares) which find distance between points and provide centroid points.
Those numbers are the average tf-idf of the cluster. So a 0 means that the word is not in the cluster, and the highest valued words are most characteristic words for the cluster.
Note that for text you'll want to use spherical k-means rather than regular k-means.
Choosing k is a big problem. Forget the elbow method, it never works except for you examples. Experiment with different k and choose the one that is most convincing or most useful. None of the usual heuristics for choosing the k in k-means will work here I fear (VRC is IMHO the best). The main reason is that the data cannot be well partitioned into k clusters. There is no reason to assume there are exactly k topics in the world, nor that every document only contains one topic. Instead, topics will be a complex structure itself. For example there is Trump, but there also is the Trump Erdogan meeting, and there is the impeachment. These are not disjoint. But you will also have articles that don't fit into any of these topics. This leads to the effect that the true best k would likely be very very large, as large as the number of articles (and hence not useful).

Aggregation node in TBB

I am new to TBB, so my apologies, if this question is obvious... but how do I set up an aggregating node with TBB? Out of all pre-made nodes I cannot find the right type for it.
Imagine I have a stream of incoming images. I want a node that keeps accepting images (with a FIFO buffer), does some calculation on them (i.e. it needs an internal state) and whenever it has received N images (fixed parameter), it emits a single result.
I think there is no such singular node in TBB flow graph that does accumulating with some sort of preprocessing and then, when accumulation is done, forwards the result of it to successor.
However, I believe the effect could be achieved by using several nodes. For example, consider queue_node as a starting point in the graph. It will serve as a buffer with FIFO semantics. After it there goes multifunction_node with N outputs. This node will do actual image preprocessing and send the result to its output port that correponds to image number. Then goes join_node that has all its N inputs connected to corresponding outputs of multifunction_node. At the end there will be a successor of join_node that will receive N images as its input. Since join_node aggregates its inputs in a tuple the drawback of this design could be quickly seen in case the number N is relatively large.
The other variant might be having the same queue_node connected with function_node with unlimited concurrency as successor (function_node is supposed to be doing some image preprocessing), and then having a multifunction_node with serial concurrency (meaning that only single instance of its body could be working at a time) that will sort of accumulate the images and do try_put call from inside the body to its successor when the number N is reached.
Of course there could be other variants how to implement desired behavior by using other flow graph topologies. By the way, to make such a graph as a singular node one could use composite_node that represents the subgraphs as a single node.

C++, determine the part that have the highest zero crosses

I’m not specialist in signal processing. I’m doing simple processing on 1D signal using c++. I want really to know how I can determine the part that have the highest zero cross rate (highest frequency!). Is there a simple way or method to tell the beginning and the end of this part.
This image illustrate the form of my signal, and this image is what I need to do (two indexes of beginning and end)
Edited:
Actually I have no prior idea about the width of the beginning and the end, it's so variable.
I could calculate the number of zero crossing, but I have no idea how to define it's range
double calculateZC(vector<double> signals){
int ZC_counter=0;
int size=signals.size();
for (int i=0; i<size-1; i++){
if((signals[i]>=0 && signals[i+1]<0) || (signals[i]<0 && signals[i+1]>=0)){
ZC_counter++;
}
}
return ZC_counter;
}
Here is a fairly simple strategy which might give you some point to start. The outline of the algorithm is as follows
Input: Vector of your data points {y0,y1,...}
Parameters:
Window size sigma.
A threshold 0<p<1 defining when to start looking for a region.
Output: The start- and endpoint {t0,t1} of the region with the most zero-crossings
I won't give any C++ code, but the method should be easy to implement. As example let us use the following function
What we desire is the region between about 480 and 600 where the zero density higher than in the front. First step in the algorithm is to calculate the positions of zeros. You can do this by what you already have but instead of counting, you store the values for i where you met a zero.
This will give you a list of zero positions
From this list (you can do this directly in the above for-loop!) you create a list having the same size as your input data which looks like {0,0,0,...,1,0,..,1,0,..}. Every zero-crossing position in your input data is marked with a 1.
The next step is to smooth this list with a smoothing filter of size sigma. Here, you can use what you like; in the simplest case a moving average or a Gaussian filter. The higher you choose sigma the bigger becomes your look around window which measures how many zero-crossings are around a certain point. Let me give the output of this filter together with the original zero positions. Note that I used a Gaussian filter of size 10 here
In a next step, you go through the filtered data find the maximum value. In this case it is about 0.15. Now you choose your second parameter which is some percentage of this maximum. Lets say p=0.6.
The final step is to go through the filtered data and when the value is greater than p you start to remember a new region. As soon as the value drops below p, you end this region and remember start and endpoint. Once you are finished walking through the data, you are left with a list of regions, each defined by a start and an endpoint. Now you choose the region with the biggest extend and you are done.
(Optionally, you could add the filter size to each end of the final region)
For the above example, I get 11 regions as follows
{{164,173},{196,205},{220,230},{241,252},{259,271},{278,290},
{297,309},{318,327},{341,350},{458,468},{476,590}}
where the one with the biggest extend is the last one {476,590}. The final result looks (with 1/2 filter region padding)
Conclusion
Please don't be discouraged by the length of my answer. I tried to explain everything in detail. The implementation is really just some loops:
one loop to create the zero-crossings list {0,0,..,1,0,...}
one nested loop for the moving average filter (or you use some library Gaussian filter). Here you can at the same time extract the maximum value
one loop to extract all regions
one loop to extract the largest region if you haven't already extracted it in the above step

Finding min distance between several points, or merging querysets with extra selects

I have a bunch of objects with location attributes (PointFields). I have two special locations, and I want to know which of those locations each object is closest to and how far that is. That is, I'd like to do something like:
q0 = q.distance(p0).extra(select={'dist_from': p0})
q1 = q.distance(p1).extra(select={'dist_from': p1})
qq = take_obj_with_min_distance(q0, q1)
(The actual query will do some stuff with bboverlaps and location__distance_lt, possibly involve more than two special locations, and possibly objects with multiple location attributes. Nevertheless, I think a solution to the above will handle all that other stuff.)
Afterwards, qq should have the same elements as q, but each element has a distance attribute and a dist_from attribute, where the distance attribute is the minimum of the distance from p0 and the distance from p1, and dist_from is the point with which it achieves that minimum.
Can I do this? Is it healthy for children and other living things?
I considered merging the queries and doing this stuff with a list, but of course you can't merge queries with extra select values (such as are introduced by distance queries). Also, I'll want to filter qq some more afterwards.
This page will give you the required code in a bunch of languages: http://www.codecodex.com/wiki/Calculate_Distance_Between_Two_Points_on_a_Globe
If you don't have too many object you might wish to do it in Python, but if you prefer querying the database it might be the best to prepare a procedure or function in SQL.

Clustering algorithm with upper bound requirement for each cluster size

I need to do a partition of approximately 50000 points into distinct clusters. There is one requirement: the size of every cluster cannot exceed K. Is there any clustering algorithm that can do this job?
Please note that upper bound, K, of every cluster is the same, say 100.
Most clustering algorithms can be used to create a tree in which the lowest level is just a single element - either because they naturally work "bottom up" by joining pairs of elements and then groups of joined elements, or because - like K-Means, they can be used to repeatedly split groups into smaller groups.
Once you have a tree, you can decide where to split off subtrees to form your clusters of size <= 100. Pruning an existing tree is often quite easy. Suppose that you want to divide an existing tree to minimise the sum of some cost of the clusters you create. You might have:
f(tree-node, list_of_clusters)
{
cost = infinity;
if (size of tree below tree-node <= 100)
{
cost = cost_function(stuff below tree-node);
}
temp_list = new List();
cost_children = 0;
for (children of tree_node)
{
cost_children += f(child, temp_list);
}
if (cost_children < cost)
{
list_of_clusters.add_all(temp_list);
return cost_children;
}
list_of_clusters.add(tree_node);
return cost;
}
One way is to use hierarchical K-means, but you keep splitting each cluster which is larger than K, until all of them are smaller.
Another (in some sense opposite approach) would be to use hierarchical agglomerative clustering, i.e. a bottom up approach and again make sure you don't merge cluster if they'll form a new one of size > K.
The issue with naive clustering is that you do indeed have to calculate a distance matrix that holds the distance of A from every other member in the set. It depends whether you've pre-processed the population or your amalgamating the clusters into typical individuals then recalculating the distance matrix again.