Best criteria for performance evaluation, error or accuracy? - weka

ANN and KNN on abalone data set using Weka.
Result for ANN
Correctly Classified Instances 3183 76.203 %
Incorrectly Classified Instances 994 23.797 %
Mean absolute error 0.214
Root mean squared error 0.3349
Relative absolute error 58.6486 %
Result for KNN|
Correctly Classified Instances 3211 76.8734 %
Incorrectly Classified Instances 966 23.1266 %
Mean absolute error 0.2142
Root mean squared error 0.3361
Relative absolute error 58.7113 %
KNN has high accuracy but ANN has low errors. So which of the two algorithms should I say is better? Which is more preferable criteria, accuracy or error?What I understood was that error should decrease with high accuracy. But the results here are opposite.Why is this so?

The answer depends on whether you want to treat the problem as classification (as suggested by the algorithms you use) or regression. If it's a classification problem, then you should only consider the % of correctly/incorrectly classified instances. Otherwise, the error.
To explain, the % of correctly classified instances takes only into account whether a prediction is accurate or not, i.e. predicting 2 instead of 1 is as incorrect as predicting 10000. The reasoning is that you get the class of the datum wrong and there is no notion of magnitude of difference between classes. For regression on the other hand, you predict a continuous quantity and the magnitude of difference matters. That is, if the actual value is 1 and the prediction 2, the model is much better than when the prediction is 10000.
This way you can get better accuracy with worse error or ice versa. What happens is that you get more correct predictions overall, but the ones that are wrong are off the mark further.
Which measure of performance you want to use really depends on your particular application. Do you simply care whether the correct class is predicted or not, or also about the distance to the correct prediction? If the latter is the case, I would recommend using regression instead of classification models.

Related

DoCPLEX Solving LP Problem Partially at a time

I am working on Linear Programming Problem with 800K Constraints and the problem takes 20 mins to solve but if I solve the problem for half horizon it just takes 1 min. Is there a way in DoCPLEX where I can solve for partial horizon and then use the solution to solve for other half of the problem without using a for-loop
Three suggestions:
load your problem as LP or SAV into cplex interactive optimizer and run display problem stats. This might show (or rule out) precision issues (ill-conditioned problem). Also it will output number of nonzeros
set datacheck parameters to 2, this might detect numerical issues in data
have you tried different LP algorithms? Using the lpmethod parameter you could try primal, dual or barrier algorithm to see whether one runs faster on your problem.
Reference:
https://www.ibm.com/support/knowledgecenter/SSSA5P_12.10.0/ilog.odms.cplex.help/CPLEX/Parameters/topics/LPMETHOD.html
In DOcplex:
model.parameters.datacheck = 2
model.parameters.lpmethod = 4 # for barrier
From your answers, I can think of the following:
if you are in pure LP (is this true?) I see no point in rounding numbers (but yes, that would help in a MIP, try rounding coefficients whose fractional part is say less than 1e-7: 4.0000001 -> 4)
1e+14 conditioning denotes serious modeling issue: a common source is mixing different objectives with coefficients. Have you tried multi-objective to avoid that?
Another source is big_M formulations, to which you should prefer indicator constraints. If you are not in these two cases, then try to renormalize the data to keep in a smaller condition range...
Finally, you might try setting markowitz tolerance to 0.99, to add extra cautiouness in simplex factorizations, but behavior may vary from one dataset to the other...

Exploding gradient for gpflow SVGP

When optimizing a SVGP with Poisson Likelihood for a big data set I see what I think are exploding gradients.
After a few epochs I see a spiky drop of the ELBO, which then very slowly recovers after getting rid of all progress made before.
Roughly 21 iterations correspond to an Epoch.
This spike (at least the second one) resulted in a complete shift of the parameters (for vectors of parameters I just plotted the norm to see changes):
How can I deal with that? My first approach would be to clip the gradient, but that seems to require digging around the gpflow code.
My Setup:
Training works via Natural Gradients for the variational parameters and ADAM for the rest, with a slowly (linearly) increasing schedule for the Natural Gradient Gamma.
The batch and inducing point sizes are as large as possible for my setup
(both 2^12, with the data set consisting of ~88k samples). I include 1e-5 jitter and initialize the inducing points with kmeans.
I use a combined kernel, consisting of a combination of RBF, Matern52, a periodic and a linear kernel on a total of 95 features (a lot of them due to a one-hot encoding), all learnable.
The lengthscales are transformed with gpflow.transforms.
with gpflow.defer_build():
k1 = Matern52(input_dim=len(kernel_idxs["coords"]), active_dims=kernel_idxs["coords"], ARD=False)
k2 = Periodic(input_dim=len(kernel_idxs["wday"]), active_dims=kernel_idxs["wday"])
k3 = Linear(input_dim=len(kernel_idxs["onehot"]), active_dims=kernel_idxs["onehot"], ARD=True)
k4 = RBF(input_dim=len(kernel_idxs["rest"]), active_dims=kernel_idxs["rest"], ARD=True)
#
k1.lengthscales.transform = gpflow.transforms.Exp()
k2.lengthscales.transform = gpflow.transforms.Exp()
k3.variance.transform = gpflow.transforms.Exp()
k4.lengthscales.transform = gpflow.transforms.Exp()
m = gpflow.models.SVGP(X, Y, k1 + k2 + k3 + k4, gpflow.likelihoods.Poisson(), Z,
mean_function=gpflow.mean_functions.Constant(c=np.ones(1)),
minibatch_size=MB_SIZE, name=NAME)
m.mean_function.set_trainable(False)
m.compile()
UPDATE: Using only ADAM
Following the suggestion by Mark, I switched to ADAM only,
which helped me get rid of that sudden explosion. However, I still initialized with an epoch of natgrad only, which seems to save a lot of time.
In addition, the variational parameters seem to change a lot less abrupt (in terms of their norm at least). I guess they'll converge way slower now, but at least it's stable.
Just to add to Mark's answer above, when using nat grads in non-conjugate models it can take a bit of tuning to get the best performance, and instability is potentially a problem. As Mark points out, the large steps that provide potentially faster convergence can also lead to the parameters ending up in in bad regions of the parameter space. When the variational approximation is good (i.e. the true and approximate posterior are close) then there is good reason to expect that the nat grad will perform well, but unfortunately there is no silver bullet in the general case. See https://arxiv.org/abs/1903.02984 for some intuition.
This is very interesting. Perhaps trying to not use natgrads is a good idea as well. Clipping gradients indeed seems like a hack that could work. And yes, this would require digging around in the GPflow code a bit. One tip that can help towards this, is by not using the GPflow optimisers directly. The model._likelihood_tensor contains the TF tensor that should be optimised. Perhaps with some manual TensorFlow magic, you can do the gradient clipping on here before running an optimiser.
In general, I think this sounds like you've stumbled on an actual research problem. Usually these large gradients have a good reason in the model, which can be addressed with careful thought. Is it variance in some monte carlo estimate? Is the objective function behaving badly?
Regarding why not using natural gradients helps. Natural gradients use the Fisher matrix as a preconditioner to perform second order optimisation. Doing so can result in quite aggressive moves in parameter space. In certain cases (when there are usable conjugacy relations) these aggressive moves can make optimisation much faster. This case, with the Poisson likelihood, is not one where there are conjugacy relations that will necessarily help optimisation. In fact, the Fisher preconditioner can often be detrimental, particularly when variational parameters are not near the optimum.

Why does skipgram model take more time than CBOW

Why does skipgram model take more time than CBOW model. I train the model with same parameters (Vector size and window size).
The skip-gram approach involves more calculations.
Specifically, consider a single 'target word' with a context-window of 4 words on either side.
In CBOW, the vectors for all 8 nearby words are averaged together, then used as the input for the algorithm's prediction neural-network. The network is run forward, and its success at predicting the target word is checked. Then back-propagation occurs: all neural-network connection values – including the 8 contributing word-vectors – are nudged to make the prediction slightly better.
Note, though, that the 8-word-window and one-target-word only require one forward-propagation, and one-backward-propagation – and the initial averaging-of-8-values and final distribution-of-error-correction-over-8-vectors are each relatively quick/simple operations.
Now consider instead skip-gram. Each of the 8 context-window words is in turn individually provided as input to the neural-network, forward-checked for how well the target word is predicted, then backward-corrected. Though the averaging/splitting is not done, there's 8 times as much of the neural-network operations. Hence, much more net computation and more run-time.
Note the extra effort/time may pay itself back by improving vector quality on your final evaluations. Whether and to what extent depends on your specific goals and corpus.

Another way to calculate double type variables in c++?

Short version of the question: overflow or timeout in current settings when calculating large int64_t and double, anyway to avoid these?
Test case:
If only demand is 80,000,000,000, solved with correct result. But if it's 800,000,000,000, returned incorrect 0.
If input has two or more demands (means more inequalities need to be calculated), smaller value will also cause incorrectness. e.g., three equal demands of 20,000,000,000 will cause the problem.
I'm using COIN-OR CLP linear programming solver to solve some network flow problems. I use int64_t when representing the link bandwidth. But CLP uses double most of time and cannot transfer to other types easily.
When the values of the variables are not that large (typically smaller than 10,000,000,000) and the constraints (inequalities) are relatively few, it will give the solution I want it to. But if either of the above factors increases, the tool will stop and return a 0 value solution. I think the reason is the calculation complexity is over its maximum, so program breaks at some trivial point (it uses LP simplex method).
The inequality is some kind of:
totalFlowSum <= usePercentage * demand
I changed it to
totalFlowSum - usePercentage * demand <= 0
Since totalFLowSum and demand are very large int64_t, usePercentage is double, if the constraints like this are too many (several or even more), or if the demand is larger than 100,000,000,000, the returned solution will be wrong.
Is there any way to correct this, like increase the break threshold or avoid this level of calculation magnitude?
Decrease some accuracy is acceptable. I have a possible solution is that 1,000 times smaller on inputs and 1,000 time larger on outputs. But this is kind of naïve and may cause too much code modification in the program.
Update:
I have changed the formulation to
totalFlowSum / demand - usePercentage <= 0
but the problem still exists.
Update 2:
I divided usePercentage by 1000, making its coefficient from 1 to 0.001, it worked. But if I also divide totalFlowSum/demand by 1000 simultaneously, still no result. I don't know why...
I changed the rhs of equalities from 0 to 0.1, the problem is then solved! Since the inputs are very large, 0.1 offset won't impact the solution at all.
I think the reason is that previous coeffs are badly scaled, so the complier failed to find an exact answer.

Algorithm analysis - Expected growth rates

I have a question about clarification of my homework.
http://www.cs.bilkent.edu.tr/~gunduz/teaching/cs201/cs201_homework3.pdf
To see the handout please go to page 25 of http://www.scribd.com/nanny24/d/36657378-Data-Structures-and-Algorithm-Analysis-in-C-Weiss.
Following is what I need to do, but I didn't understand what this means. Does it mean -for algorithm 1- compare actual running time versus (n^3 + 3*(n^2) + 2*n)/6, n=array size?
I don't think so, but I couldn't infer anything else. Can you please explain me what this is?
2- Plot the expected growth rates obtained from the theoretical analysis (as given for each solution) by
using the same N values that you used in obtaining your results. Compare the expected growth rates
and the obtained results, and discuss your observations in a paragraph.
EDIT 2:
Algorithm 1:
n actual running time(ms) (n^3 + 3*(n^2) + 2*n)/6 (I don't know whether the type is millisecond or not)
100 1 171700
1000 851 167167000
So considering this huge difference between actual running time and theoretical running time, what the instructor means may be different than theoretical time complexity function which is (n^3 + 3*(n^2) + 2*n)/6 for the algorithm 1. This is the function: http://www.diigo.com/item/image/2lxmz/m7y3?size=o
Yes, your instructor means by "expected growth rate" the predicted running time after you plug in the value of n in the theoretical time complexity function.
While this usage is standard, I would still check with the instructor if I were you.
The theoretical number is probably the number of operations or comparisons or something similar.
I guess that growth rate means how fast does the value grow?. When n goes from 100 to 1000, the theoretical value grows by the factor 167167000/171700 = 973.6, compared to the real-word measured factor of 851.