Selecting subset of sequences so their sum has minimum variance? - combinations

I have a set of N real sequences and need to pick K sequences (with no replacement) such that their sum has the minimum variance.
E.g. I have N=3 real sequences of length 5:
x(1)=[-0.9 0.7 2.0 2.5 1.5]
x(2)=[-1.8 -0.2 0.5 -1.3 -0.7]
x(3)=[-1.5 -0.9 0.3 1.5 0.4]
If I need to select K=2 sequences, the variance of the sums is:
var(x(1)+x(2))=3.7
var(x(1)+x(3))=6.1
var(x(2)+x(3))=2.5
So I'd want to select sequences 2 & 3.
This is easy to brute force for small N, but my real application has much larger N. For example, for N=20 and K=10, there are 184756 combinations. Since my sequence lengths are long and computational time is critical, this is not feasible.
Is there an efficient algorithm to do the selection? Or even to give an approximate solution? Or reduce the problem space to likely candidates?

Related

Poisson distribution or Normal distribution

If it is needed to generate randoms in [N, M] range, but with more numbers close to avg (N <= avg <= M), which is better to use:
poisson_distribution or
normal_distribution?
Seeing at examples at cppreference pages (at bottom of the pages), they both generate what is needed:
poisson_distribution at point 4:
0 *
1 *******
2 **************
3 *******************
4 *******************
5 ***************
6 **********
7 *****
8 **
9 *
10
11
12
13
normal_distribution at point 5 with standard deviation 2:
-2
-1
0
1 *
2 ***
3 ******
4 ********
5 **********
6 ********
7 *****
8 ***
9 *
10
11
12
What to choose? May be something else?
Neither choice is great if you need the outcomes on a bounded range. The normal distribution has infinite tails at both ends, the Poisson distribution has an infinite upper tail. At a minimum you'd want a truncated form of one of them. If you're not truncating, note that the normal is always symmetric about its mean while a Poisson can be quite skewed. The two distributions also differ in the fact that the normal is continuous, the Poisson is discrete, although you can discretize continuous distributions by binning the results.
If you want a discrete set of outcomes on a bounded range, you could try a scaled and shifted binomial distribution. A binomial with parameters n and p counts how many "successes" you get out of n trials when the trials are independent and all yield success with probability p. Make n = M - N and shift the outcome by N to get outcomes in the range [N,M].
If you want a continuous range of outcomes, consider a beta distribution. You can fudge the parameters to get a wide variety of distribution shapes and dial in the mean to where you want it, and scale+shift it to any range you want.
You can center both distributions in a point that suits your needs.
But if M is small, then the Poisson distribution has a 'fat tail', that is, the probability of getting a number above M is higher compared to the normal distribution.
In the normal case, you can control this chance via the variance parameter (it can be as small as you want).
The other, rather obvious difference is that Poisson will onli give you positive integers, whreas a Normal Distribution will give any number in the [N,M] range.
Plus, when [N,M] are large enough, the Poisson converges to a Normal distribution. So even if the Poisson is the right model, the normal approximation won't be so inaccurate.
With this in mind, if the numbers do not simulate a counting process, I would go for the Normal.
If you need distribution which is within range (not an infinite or semi-infinite one like normal or Poisson), but have clear maximum, you may try Irwin-Hall one with several degrees of freedom. Say IH(16) will have minimum at 0, maximum at 16 and peak at 8, see http://en.wikipedia.org/wiki/Irwin%E2%80%93Hall_distribution
Very easy to sample, easy to scale, and you could play with n to get peak wider or narrower
I prefer Normal distribution, because it is closer to real life problems, while Poisson distribution is used for special cases only. Choosing N.D makes your problem more general.

My neural net learns sin x but not cos x

I have build my own neural net and I have a weird problem with it.
The net is quite a simple feed-forward 1-N-1 net with back propagation learning. Sigmoid is used as activation function.
My training set is generated with random values between [-PI, PI] and their [0,1]-scaled sine values (This is because the "Sigmoid-net" produces only values between [0,1] and unscaled sine -function produces values between [-1,1]).
With that training-set, and the net set to 1-10-1 with learning rate of 0.5, everything works great and the net learns sin-function as it should. BUT.. if I do everything exately the same way for COSINE -function, the net won't learn it. Not with any setup of hidden layer size or learning rate.
Any ideas? Am I missing something?
EDIT: My problem seems to be similar than can be seen with this applet. It won't seem to learn sine-function unless something "easier" is taught for the weights first (like 1400 cycles of quadratic function). All the other settings in the applet can be left as they initially are. So in the case of sine or cosine it seems that the weights need some boosting to atleast partially right direction before a solution can be found. Why is this?
I'm struggling to see how this could work.
You have, as far as I can see, 1 input, N nodes in 1 layer, then 1 output. So there is no difference between any of the nodes in the hidden layer of the net. Suppose you have an input x, and a set of weights wi. Then the output node y will have the value:
y = Σi w_i x
= x . Σi w_i
So this is always linear.
In order for the nodes to be able to learn differently, they must be wired differently and/or have access to different inputs. So you could supply inputs of the value, the square root of the value (giving some effect of scale), etc and wire different hidden layer nodes to different inputs, and I suspect you'll need at least one more hidden layer anyway.
The neural net is not magic. It produces a set of specific weights for a weighted sum. Since you can derive a set weights to approximate a sine or cosine function, that must inform your idea of what inputs the neural net will need in order to have some chance of succeeding.
An explicit example: the Taylor series of the exponential function is:
exp(x) = 1 + x/1! + x^2/2! + x^3/3! + x^4/4! ...
So if you supplied 6 input notes with 1, x1, x2 etc, then a neural net that just received each input to one corresponding node, and multiplied it by its weight then fed all those outputs to the output node would be capable of the 6-term taylor expansion of the exponential:
in hid out
1 ---- h0 -\
x -- h1 --\
x^2 -- h2 ---\
x^3 -- h3 ----- y
x^4 -- h4 ---/
x^5 -- h5 --/
Not much of a neural net, but you get the point.
Further down the wikipedia page on Taylor series, there are expansions for sin and cos, which are given in terms of odd powers of x and even powers of x respectively (think about it, sin is odd, cos is even, and yes it is that straightforward), so if you supply all the powers of x I would guess that the sin and cos versions will look pretty similar with alternating zero weights. (sin: 0, 1, 0, -1/6..., cos: 1, 0, -1/2...)
I think you can always compute sine and then compute cosine externally. I think your concern here is why the neural net is not learning the cosine function when it can learn the sine function. Assuming that this artifact if not because of your code; I would suggest the following:
It definitely looks like an error in the learning algorithm. Could be because of your starting point. Try starting with weights that gives the correct result for the first input and then march forward.
Check if there is heavy bias in your learning - more +ve than -ve
Since cosine can be computed by sine 90 minus angle, you could find the weights and then recompute the weights in 1 step for cosine.

generating a pseduo-random positive definite matrix

I wanted to test a simple Cholesky code I wrote in C++. So I am generating a random lower-triangular L and multiplying by its transpose to generate A.
A = L * Lt;
But my code fails to factor A. So I tried this in Matlab:
N=200; L=tril(rand(N, N)); A=L*L'; [lc,p]=chol(A,'lower'); p
This outputs non-zero p which means Matlab also fails to factor A. I am guessing the randomness generates rank-deficient matrices. Am I right?
Update:
I forgot to mention that the following Matlab code seems to work as pointed out by Malife below:
N=200; L=rand(N, N); A=L*L'; [lc,p]=chol(A,'lower'); p
The difference is L is lower-triangular in the first code and not the second one. Why should that matter?
I also tried the following with scipy after reading A simple algorithm for generating positive-semidefinite matrices:
from scipy import random, linalg
A = random.rand(100, 100)
B = A*A.transpose()
linalg.cholesky(B)
But it errors out with:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/scipy/linalg/decomp_cholesky.py", line 66, in cholesky
c, lower = _cholesky(a, lower=lower, overwrite_a=overwrite_a, clean=True)
File "/usr/lib/python2.7/dist-packages/scipy/linalg/decomp_cholesky.py", line 24, in _cholesky
raise LinAlgError("%d-th leading minor not positive definite" % info)
numpy.linalg.linalg.LinAlgError: 2-th leading minor not positive definite
I don't understand why that's happening with scipy. Any ideas?
Thanks,
Nilesh.
The problem is not with the cholesky factorization. The problem is with the random matrix L.
rand(N,N) is much better conditioned than tril(rand(N,N)). To see this, compare cond(rand(N,N)) to cond(tril(rand(N,N))). I got something like 1e3 for the first and 1e19 for the second, so the conditioning number of the second matrix is much higher and computations will be less stable numerically.
This will result in getting some small negative eigenvalues in the ill-conditioned case - to see this look at the eigenvalues using eig(), some small ones will be negative.
So I would suggest to use rand(N,N) to generate a numerically stable random matrix.
BTW if you are interested in the theory of why this happens, you can look at this paper:
http://epubs.siam.org/doi/abs/10.1137/S0895479896312869
As has been said before, eigen values of a triangular matrix lie on the diagonal. Hence, by doing
L=tril(rand(n))
you made sure that eig(L) only yield positive values. You can improve the condition number of L*L' by adding a large enough positive number to the diagonal, e.g.
L=L+n*eye(n)
and L*L' is positive definite and well conditioned:
> cond(L*L')
ans =
1.8400
To generate a random positive definite matrix in MATLAB your code should read:
N=200;
L=rand(N, N);
A=L*transpose(L);
[lc,p]=chol(A,'lower');
eig(A)
p
And you should indeed have the eigenvalues be greater than zero and p be zero.
You ask about the lower triangular case. Lets see what happens, and why there are problems. This is often a good thing to do, to look at a test case.
For a simple 5x5 matrix,
L = tril(rand(5))
L =
0.72194 0 0 0 0
0.027804 0.78422 0 0 0
0.26607 0.097189 0.77554 0 0
0.96157 0.71437 0.98738 0.66828 0
0.024571 0.046486 0.94515 0.38009 0.087634
eig(L)
ans =
0.087634
0.66828
0.77554
0.78422
0.72194
Of course, the eigenvalues of a triangular matrix are just the diagonal elements. Since the elements generated by rand are always between 0 and 1, on average they will be roughly 1/2. Perhaps looking at the distribution of the determinant of L will help. Better is to consider the distribution of log(det(L)). Since the determinant will be simply the product of the diagonal elements, the log is the sum of the logs of the diagonal elements. (Yes, I know the determinant is a poor measure of singularity, but the distribution of log(det(L)) is easily computed and I'm feeling too lazy to think about the distribution of the condition number.)
Ah, but the negative log of a uniform random variable is an exponential variate, in this case an exponential with lambda = 1. The sum of the logs of a set of n uniform random numbers from the interval (0,1) will by the central limit theorem be Gaussian. The mean of that sum will be -n. Therefore the determinant of a lower triangular nxn matrix generated by such a scheme will be exp(-n). When n is 200, MATLAB tells me that
exp(-200)
ans =
1.3839e-87
Thus for a matrix of any appreciable size, we can see that it will be poorly conditioned. Worse, when you form the product L*L', it will generally be numerically singular. The same arguments apply to the condition number. Thus, for even a 20x20 matrix, see that the condition number of such a lower triangular matrix is fairly large. Then when we form the matrix L*L', the condition will be squared as expected.
L = tril(rand(20));
cond(L)
ans =
1.9066e+07
cond(L*L')
ans =
3.6325e+14
See how much better things are for a full matrix.
A = rand(20);
cond(A)
ans =
253.74
cond(A*A')
ans =
64384

Rounding a value contained within a CString

So I have a CString which contains a number value e.g. "45.05" and I would like to round this number to one decimal place.
I use this funcion
_stscanf(strValue, _T("%f"), &m_Value);
to put the value into a float which i can round. However in the case of 45.05 the number i get is 45.04999... which rounds to 45.0 where one would expect 45.1
How can I get the correct value from my CString?
TIA
If you need a string result, your best bet is to find the decimal point and inspect the two digits after it and use them to make a rounded result. If you need a floating-point number as a result, well.. it's hopeless since 45.1 cannot be represented exactly.
EDIT: the nearest you can come to rounding with arithmetic is computing floor(x*10+0.5)/10, but know that doing this with 45.05 WILL NOT and CAN NOT result in 45.1.
You could extract the digits that make up the hundredths and below positions separately, convert them to a number and round it independently, and then add that to the rest of the number:
"45.05" = 45.0 and 0.5 tenths (0.5 can be represented exactly in binary)
round 0.5 tenths to 1
45.0 + 1 tenth = 45.1
don't confuse this with just handling the fractional position separately. "45.15" isn't divided into 45 and .15, it's divided into 45.1 and 0.5 tenths.
I haven't used c++ in a while but here are the steps I would take.
Count the characters after the Decimal
Remove the Decimal
cast the string to an Int
Perform Rounding operation
Divide by the (number of characters less one)*10
Store result in a float

Continuous coloring of fractal

I'm trying to visualize Mandelbrot set with OpenGL and have found very strange behaviour when it comes to smooth coloring.
Let's assume, for the current complex valueC, algorithm have escaped after n iterations when Z had been proven to be more than 2.
I programmed coloring part like this:
if(n==maxIterations){
color=0.0; //0.0 is black in OpenGL when put to each channel of RGB
//Points in M-brot set are colored black.
} else {
color = (n + 1 - log(log(abs(Z)))/log(2.0) )/maxIterations;
//continuous coloring algorithm, color is between 0.0 and 1.0
//Points outside M-brot set are colored depending of their absolute value,
//from brightest near the edge of set to darkest far away from set.
}
glColor3f(color ,color ,color );
//OpenGL-command for making RGB-color from three channel values.
The problem is, this just don't work well. Some smoothing is noticable, but it's not perfect.
But when I add two additional iterations (just found this somewhere without explanation)
Z=Z*Z+C;
n++;
in the "else"-branch before calculating a color, the picture comes out to be absolutely, gracefully smooth.
Why does this happen? Why do we need to place additional iterations in coloring part after checking the point to be in set?
I'm not actually certain, but I'm guessing it has something to do with the fact that the log of the log of a number (log(log(n))) is somewhat of an unstable thing for "small" numbers n, where "small" in this case means something close to 2. If your Z has just escaped, it's close to 2. If you continue to iterate, you get (quickly) further and further from 2, and log(log(abs(Z))) stabilizes, thus giving you a more predictable value... which, then, gives you smoother values.
Example data, arbitrarily chosen:
n Z.real Z.imag |Z| status color
-- ----------------- ----------------- ----------- ------- -----
0 -0.74 -0.2 0.766551 bounded [nonsensical]
1 -0.2324 0.096 0.251447 bounded [nonsensical]
2 -0.69520624 -0.2446208 0.736988 bounded [nonsensical]
3 -0.31652761966 0.14012381319 0.346157 bounded [nonsensical]
4 -0.65944494902 -0.28870611409 0.719874 bounded [nonsensical]
5 -0.38848357953 0.18077157738 0.428483 bounded [nonsensical]
6 -0.62175887162 -0.34045357891 0.708867 bounded [nonsensical]
7 -0.46932454495 0.22336006613 0.519765 bounded [nonsensical]
8 -0.56962419064 -0.40965672279 0.701634 bounded [nonsensical]
9 -0.58334691196 0.26670075833 0.641423 bounded [nonsensical]
10 -0.4708356748 -0.51115812757 0.69496 bounded [nonsensical]
11 -0.77959639873 0.28134296385 0.828809 bounded [nonsensical]
12 -0.2113833184 -0.63866792284 0.67274 bounded [nonsensical]
13 -1.1032138084 0.070007489775 1.10543 bounded 0.173185134517425
14 0.47217965836 -0.35446645882 0.590424 bounded [nonsensical]
15 -0.64269284066 -0.53474370285 0.836065 bounded [nonsensical]
16 -0.6128967403 0.48735189882 0.783042 bounded [nonsensical]
17 -0.60186945901 -0.79739278033 0.999041 bounded [nonsensical]
18 -1.0135884004 0.75985272263 1.26678 bounded 0.210802091344997
19 -0.29001471459 -1.7403558114 1.76435 bounded 0.208165835763602
20 -3.6847298156 0.80945758785 3.77259 ESCAPED 0.205910029166315
21 12.182012228 -6.1652650168 13.6533 ESCAPED 0.206137522227716
22 109.65092918 -150.41066764 186.136 ESCAPED 0.20614160700086
23 -10600.782669 -32985.538932 34647.1 ESCAPED 0.20614159039676
24 -975669186.18 699345058.7 1.20042e+09 ESCAPED 0.206141590396481
25 4.6284684972e+17 -1.3646588486e+18 1.44101e+18 ESCAPED 0.206141590396481
26 -1.6480665667e+36 -1.263256098e+36 2.07652e+36 ESCAPED 0.206141590396481
Notice how much the color value is still fluctuating at n in [20,22], getting stable at n=23, and consistent from n=24 onward. And notice that the Z values there are WAY outside the circle of radius 2 that bounds the Mandelbrot set.
I haven't actually done enough of the math to be sure that this is actually a solid explanation, but that would be my guess.
AS I just made the same program and as good mathematician I'll explain it why, hope it would be clear.
If let say the sequence with seed C has escaped after k iterations.
Since the function it self is defined as Limit as the iterations go to infinity it is not that good the get something close to k. Here is why.
As we know that after 2 the sequence goes infinity let's consider some iteration T where z has become enough big. Considered to it C will be insegnificantly small, since normally you look the set in [-2,2] and [-1.5,1.5] for the 2 axis. So on the T+1 iteration z will be ~~ z^2 from the previous and is easy to check that in that case |z| of T+1 will be ~~ |z|^2 of the previous.
Our function is log(|z|)/2^k for the K-th iteration. In the case we are looking at
it is easy to see That on the T+1 iteration it will be ~~
(source: equationsheet.com)
which is the function at T iteration.
In other words as |z| becomes "significantly" bigger than the seed C the function becomes more and more stable. You DO NOT want to use an iteration close to escaping iteration k, since there actually Z will be close to 2 , and as depending on C it may not be insignificantly small compared to it and so you will not be close to the Limit.
As |C| is near 2 actually on the first escaping iteration you will be a LOT away from the Limit. If on the other hand you choose to make it after |Z|>100 for escaping bound for instance or just take several more iterations you will get quite more stable.
Hope for anyone interested in the question to have answered him good.
For now it seems that "additional iterations" idea has no convincing formulas for it, if not counting the fact it just works.
In one old version of Wikipedia's Mandelbrot article there are lines about Continuous Coloring:
Second, it is recommended that a few
extra iterations are done so that z
can grow. If you stop iterating as
soon as z escapes, there is the
possibility that the smoothing
algorithm will not work.
Better than nothing.