compare ordinal type in data mining - data-mining

I have several variables
That type of them is ordinal type .
never < rarely < occasionally < often
I would calculate the amount of nearly two variables.
Is it possible ؟
For Example 1 : v1 = often v2= often
so v1 is completely near v2
For Example 2 :
v1 = occasionally & v2 = often
For Example 3 :
v1 = rarely & v2 = often
so V1 and v2 values in example 2 ​​are closer together than values in Example 3 .
How can I show degree close of them with a number?

Pseudocode:
let never = 0, rarely = 1, occasionally = 2, often = 3
result = absolute_value(v1 - v2) //where result=0 is the closest and result=3 is the farthest

Related

How to draw b-spline curve using this math algorithm

I have to use this formula in order to draw 3rd degree b-spline curve
Can someone give me advice what am I doing wrong in my code? Doesn't seem to work properly for me and I am getting this weird results when trying to draw the curve
segment is a vector of QPoint, it has x and y
void MyWindow::calculateCurve() {
QPoint result;
int m = segment.size();
int from = m-3;
int to = m-2;
for(double t = 0.0; t<=to; t+=0.001){
result = (pow(-t, 3)+3*pow(t,2)+1)/6*(segment[segment.size()-3])+
(3*pow(t,3)-6*pow(t,2)+4)/6*(segment[segment.size()-2])+
(pow(-3*t,3)+3*pow(t,2)+3*t+1)/6*(segment[segment.size()-1]) +
(pow(t,3)/6)*(segment[segment.size()])
;
draw(result.x(), result.y());
}
}
Most often we define a common range for the parameter t (i.e. for
the whole curve, not for each segment separately). We can
e.g. assume that t ∈ [0, m - 2]. Then, for the segment Q3
parameter t varies from t3 = 0 to t4 = 1, for segment
Q4 from t4 = 1 to t5 = 2, and for the last segment Qm from
tm = m - 3 to tm+1 = m - 2.
You have written pow(-3*t,3), which means (-3t)³, but you should have written -3*pow(t,3), that is, -3(t³):
result = (pow(-t, 3)+3*pow(t,2)+1)/6*(segment[segment.size()-3])+
(3*pow(t,3)-6*pow(t,2)+4)/6*(segment[segment.size()-2])+
(-3*pow(t,3)+3*pow(t,2)+3*t+1)/6*(segment[segment.size()-1]) +
(pow(t,3)/6)*(segment[segment.size()])
;

mosek Model.getSolverIntInfo returns 0

I'm following Mosek documentation to retrieve the information of the solver. In particular I want to get the number of constraints by:
Model::t M = new Model("cqo1");
Variable::t x = M->variable("x", 3, Domain::greaterThan(0.0));
Variable::t y = M->variable("y", 3, Domain::unbounded());
Variable::t z1 = Var::vstack(y->index(0), x->slice(0, 2));
Variable::t z2 = Var::vstack(y->slice(1, 3), x->index(2));
auto aval = new_array_ptr<double, 1>({1.0, 1.0, 2.0});
M->constraint("lc", Expr::dot(aval, x), Domain::equalsTo(1.0));
Constraint::t qc1 = M->constraint("qc1", z1, Domain::inQCone());
Constraint::t qc2 = M->constraint("qc2", z2, Domain::inRotatedQCone());
M->objective("obj", ObjectiveSense::Minimize, Expr::sum(y));
int anaProNumCon = M->getSolverIntInfo("anaProNumCon");
However it returns anaProNumCon=0 (should be 3). What could be wrong in the call?
Best
Information items are only being set after you called M->solve() https://docs.mosek.com/latest/cxxfusion/solver-infitems.html and the ones from the problem analyzer are probably not until you have called the problem analyzer, which is available in the optimizer API only but not Fusion.
Moreover, since your conic constraints are multi-dimensional, the actual number of constraints that would be returned at this point is not 3, but something like 1 (lc) + 3 (qc1 slacks) + 3 (qc2 slacks) = 7 if I am not mistaken.
What I'm trying to say is that this is not a meaningful way of finding out about the problem. The information items always relate to the low-level optimizer task, but you would have to know how the Fusion model is mapped to that low-level task, and that mapping is not part of the API guaranteed by Mosek.
You can do M->writeTask("file.opf") to see how the low-level model looks like. On the other hand if you want to want to know how many Fusion constraints your Fusion model has then you have to keep track of it in your code.

Product of a multi-dimensional array (or tensor) and vectors

I would like to ask for a fast way to perform the following operations, either in native Matlab, C++, or using toolboxes/libraries, whichever would give the fastest solutions.
Let M be a tensor of D dimensions: n1 x n2 x... x nD, and let v1, v2,..., vD be D vectors whose dimensions are respectively n1, n2,..., nD.
Compute the product M*vi (1 <= i <= D). The result is a multi-dimensional array of (D-1) dimensions.
Compute the product of M with all vectors, except vi.
For example, with D = 3:
The product of M and v1 is a tensor N of 2 dimensions (i.e. a matrix) where
N[i2][i3] = Sum_over_i1 of M[i1][i2][i3]*v1[i1]
The product of M and v2 is a matrix N where
N[i1][i3] = Sum_over_i2 of M[i1][i2][i3]*v2[i2]
The product of M and v2 and v3 is a vector v where
v[i1] = Sum_over_i2 of (Sum_over_i3 of M[i1][i2][i3]*v2[i2]*v3[i3])
A further question: the above but for sparse tensors.
An example of Matlab code is given below.
Thank you very much in advance for your help!!
n1 = 3;
n2 = 5;
n3 = 4;
M = randn(n1,n2,n3);
v1 = randn(n1,1);
v2 = randn(n2,1);
v3 = randn(n3,1);
%% N = M*v2
N = zeros(n1,n3);
for i1=1:n1
for i3=1:n3
for i2=1:n2
N(i1,i3) = N(i1,i3) + M(i1,i2,i3)*v2(i2);
end
end
end
%% v = M*v2*v3
v = zeros(n1,1);
for i1=1:n1
for i2=1:n2
for i3=1:n3
v(i1) = v(i1) + M(i1,i2,i3)*v2(i2)*v3(i3);
end
end
end
I've noticed that operation you are describing takes (D - 1) dimensional slices of M and scales them by the corresponding entry of vi subsequently summing the result over the indices of vi. This code seems to work for getting N in your example:
N2 = squeeze(sum(M.*(v2)', 2));
To get v in your code, all you need to do is multiply N by v3:
v2 = N2*v3;
EDIT
On older versions of MatLab the element-wise operator .* doesn't work the way I've used it above. One alternative is bsxfun:
N2 = squeeze(sum(bsxfun(#times, M, v2'), 2));
Just checked: In terms of performance, the bsxfun way seems as fast as the .* way for large arrays, at least on R2016b.

Incremental entropy computation

Let std::vector<int> counts be a vector of positive integers and let N:=counts[0]+...+counts[counts.length()-1] be the the sum of vector components. Setting pi:=counts[i]/N, I compute the entropy using the classic formula H=p0*log2(p0)+...+pn*log2(pn).
The counts vector is changing --- counts are incremented --- and every 200 changes I recompute the entropy. After a quick google and stackoverflow search I couldn't find any method for incremental entropy computation. So the question: Is there an incremental method, like the ones for variance, for entropy computation?
EDIT: Motivation for this question was usage of such formulas for incremental information gain estimation in VFDT-like learners.
Resolved: See this mathoverflow post.
I derived update formulas and algorithms for entropy and Gini index and made the note available on arXiv. (The working version of the note is available here.) Also see this mathoverflow answer.
For the sake of convenience I am including simple Python code, demonstrating the derived formulas:
from math import log
from random import randint
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0
# update entropy if new example x comes in
def update(H, S, x):
new_S = S+x
return 1.0*H*S/new_S+h(1.0*x/new_S)+h(1.0*S/new_S)
# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
S = S1+S2
return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)
# compute entropy(L) using only `update' function
def test(L):
S = 0.0 # sum of the sample elements
H = 0.0 # sample entropy
for x in L:
H = update(H, S, x)
S = S+x
return H
# compute entropy using the classic equation
def entropy(L):
n = 1.0*sum(L)
return sum([h(x/n) for x in L])
# entry point
if __name__ == "__main__":
L = [randint(1,100) for k in range(100)]
M = [randint(100,1000) for k in range(100)]
L_ent = entropy(L)
L_sum = sum(L)
M_ent = entropy(M)
M_sum = sum(M)
T = L+M
print("Full = ", entropy(T))
print("Update = ", update(L_ent, L_sum, M_ent, M_sum))
You could re-compute the entropy by re-computing the counts and using some simple mathematical identity to simplify the entropy formula
K = count.size();
N = count[0] + ... + count[K - 1];
H = count[0]/N * log2(count[0]/N) + ... + count[K - 1]/N * log2(count[K - 1]/N)
= F * h
h = (count[0] * log2(count[0]) + ... + count[K - 1] * log2(count[K - 1]))
F = -1/(N * log2(N))
which holds because of log2(a / b) == log2(a) - log2(b)
Now given an old vector count of observations so far and another vector of new 200 observations called batch, you can do in C++11
void update_H(double& H, std::vector<int>& count, int& N, std::vector<int> const& batch)
{
N += batch.size();
auto F = -1/(N * log2(N));
for (auto b: batch)
++count[b];
H = F * std::accumulate(count.begin(), count.end(), 0.0, [](int elem) {
return elem * log2(elem);
});
}
Here I assume that you have encoded your observations as int. If you have some kind of symbol, you would need a symbol table std::map<Symbol, int>, and do a lookup for each symbol in batch before you update the count.
This seems the quickest way of writing some code for a general update. If you know that in every batch only few counts actually change, you can do as #migdal does and keep track of the changing counts, subtract their old contribution to the entropy and add the new contribution.

Variation on set cover problem in R / C++

Given a universe of elements U = {1, 2, 3,...,n} and a number of sets in this universe {S1, S2,...,Sm}, what is the smallest set we can create that will cover at least one element in each of the m sets?
For example, given the following elements U = {1,2,3,4} and sets S = {{4,3,1},{3,1},{4}}, the following sets will cover at least one element from each set:
{1,4}
or
{3,4}
so the minimum sized set required here is 2.
Any thoughts on how this can be scaled up to solve the problem for m=100 or m=1000 sets? Or thoughts on how to code this up in R or C++?
The sample data, from above, using R's library(sets).
s1 <- set(4, 3, 1)
s2 <- set(3, 1)
s3 <- set(4)
s <- set(s1, s2, s3)
Cheers
This is the hitting set problem, which is basically set cover with the roles of elements and sets interchanged. Letting A = {4, 3, 1} and B = {3, 1} and C = {4}, the element-set containment relation is
A B C
1 + + -
2 - - -
3 + + -
4 + - +
so you basically want to solve the problem of covering {A, B, C} with sets 1 = {A, B} and 2 = {} and 3 = {A, B} and 4 = {A, C}.
Probably the easiest way to solve nontrivial instances of set cover in practice is to find an integer programming package with an interface to R or C++. Your example would be rendered as the following integer program, in LP format.
Minimize
obj: x1 + x2 + x3 + x4
Subject To
A: x1 + x3 + x4 >= 1
B: x1 + x3 >= 1
C: x4 >= 1
Binary
x1 x2 x3 x4
End
At first I misunderstood the complexity of the problem and came up with a function that finds a set that covers the m sets - but I then realized that it isn't necessarily the smallest one:
cover <- function(sets, elements = NULL) {
if (is.null(elements)) {
# Build the union of all sets
su <- integer()
for(si in sets) su <- union(su, si)
} else {
su <- elements
}
s <- su
for(i in seq_along(s)) {
# create set candidate with one element removed
sc <- s[-i]
ok <- TRUE
for(si in sets) {
if (!any(match(si, sc, nomatch=0L))) {
ok <- FALSE
break
}
}
if (ok) {
s <- sc
}
}
# The resulting set
s
}
sets <- list(s1=c(1,3,4), s2=c(1,3), s3=c(4))
> cover(sets) # [1] 3 4
Then we can time it:
n <- 100 # number of elements
m <- 1000 # number of sets
sets <- lapply(seq_len(m), function(i) sample.int(n, runif(1, 1, n)))
system.time( s <- cover(sets) ) # 0.53 seconds
Not too bad, but still not the smallest one.
The obvious solution: generate all permutations of elements and pass is to the cover function and keep the smallest result. This will take close to "forever".
Another approach is to generate a limited number of random permutations - this way you get an approximation of the smallest set.
ns <- 10 # number of samples
elements <- seq_len(n)
smin <- sets
for(i in seq_len(ns)) {
s <- cover(sets, sample(elements))
if (length(s) < length(smin)) {
smin <- s
}
}
length(smin) # approximate smallest length
If you restrict each set to have 2 elements, you have the np-complete problem node cover. I would guess the more general problem would also be NP complete (for the exact version).
If you're just interested in an algorithm (rather than an efficient/feasible algorithm), you can simply generate subsets of the universe of increasing cardinality and check that the intersection with all the sets in S is non-empty. Stop as soon as you get one that works; the cardinality is the minimum possible.
The complexity of this is 2^|U| in the worst case, I think. Given Foo Bah's answer, I don't think you're going to get a polynomial-time answer...