How to understand the learning rate choosing in SGD

How to understand the learning rate choosing in SGD - gradient

Like most numerical hyper-parameters,
the learning rates should be explored in the log-domain, and there is not much to be gained
by refining it more than a factor of 2, whereas the dynamic range explored could be around
10^6, learning rates are typically below 1.
in this paragraph, I want to know
1: what is the meaning of log-domain?
2: what is the use of the factor of 2?
3: what is the meaning of dynamic range?
Thanks!

It's probably talking about how to scale learning rates
1: http://en.wikipedia.org/wiki/Logarithmic_scale
2: http://en.wikipedia.org/wiki/Scale_factor
3: http://en.wikipedia.org/wiki/Dynamic_range

Related

what's the difference between optimal solution of cplex and the optimaltol solution of cplex？

what's the difference between optimal solution of cplex and the optimaltol solution of cplex？
1、When I solve a integer programming model with CPLEX solver, the result status of some instances show as “optimal”，however，the result status of some instances show as “optimalTol”. I want to konw the difference between optimal solution of cplex and the optimaltol solution of cplex？
2、My integer programming model is minimize the objective. when I solve the integer programming model with CPLEX solver，the result status is “optimalTol” and the objective value of the model is 1000 for example. When I add cplex.setParam(IloCplex::EpGap,0.0) for the solver. Then solve the model with CPLEX solver again. I want to know whether the value of the objective function will become larger or smaller?

This is relevant for any MIP solver.
A solver can stop for different reasons:
time limit or some other limit (iteration, node limit, etc.)
the gap tolerance has been met
the solution is proven optimal
The default gap tolerance is not zero, but rather something like 1e-4 (for the relative gap) and 1e-6 (for the absolute gap). That means that Cplex can stop while not being 100% sure that there are no better solutions. However, the gap tolerance will bound how much better the solution can be. E.g. if the relative gap tolerance is 1% and the absolute gap tolerance is 0 then the best possible solution cannot be farther away than 1%. If you set both to zero Cplex will need more time but always will deliver OPTIMAL (unless you hit a limit). If you allow a small gap, Cplex will most likely return OPTIMAL_TOL (we stopped because we met the gap tolerance) but in some case can be sure there is no better solution, in which case it will return OPTIMAL. For large, practical models we often are happy with a solution that is better than say 5% from the best possible.

Donald Knuth algorithm for Mastermind - can we do better?

I implemented Donald Knuth 1977 algorithm for Mastermind https://www.cs.uni.edu/~wallingf/teaching/cs3530/resources/knuth-mastermind.pdf
I was able to reproduce his results - 5 guess to win in the worst case and 4.476 on average.
And then I tried something different. I ran Knuth's algorithm repeatedly and shuffled the entire list of combinations randomly each time before starting. I was able to land on a strategy with 5 guesses to win in the worst case (like Knuth) but with 4.451 guesses to win on average. Better than Knuth.
Are there any previous work trying to outperform Knuth algorithm on average , while maintaining the worst case ? I could not find any indication of it on the web so far.
Thanks!
Alon

In the paper, Knuth describes how the strategy was chosen:
Table 1 was found by choosing at every stage a test pattern that minimizes the maximum number of remaining possibilities, over all conceivable responses by the codemaker. If this minimum can be achieved by a “valid” pattern (a pattern that makes “four black hits” possible), a valid one should be used. Subject to this condition, the first such test pattern in numeric order was selected. Fortunately this procedure turns out to guarantee a win in five moves.
So it is to some extent a greedy strategy (trying to make the most progress at each step, rather than overall), and moreover there's an ad-hoc tie-breaking strategy. This means that it need not be optimal in expected value, and indeed Knuth says exactly that:
The strategy in Table 1 isn’t optimal from the “expected number of moves” standpoint, but it is probably very close. One line that can be improved [...]
So already at the time the paper was published, Knuth was aware that it's not optimal and even had an explicit example.
When this paper was republished in his collection Selected Papers on Fun and Games (2010), he adds a 5-page addendum to the 6-page paper. In this addendum, he starts by mentioning randomization in the very first paragraph, and discusses the question of minimizing the expected number of moves. Analyzing it as the sum of all moves made over all 1296 possible codewords, he mentions a few papers:
His original algorithm gave 5801 (average of 5801/1296 ≈ 4.47608), and the minor improvement gives 5800 (≈ 4.4753).
Robert W. Irving, “Towards an optimum Mastermind strategy,” Journal of Recreational Mathematics 11 (1978), 81-87 [while staying within the “at most 5” achieves 5664 ⇒ ≈4.37]
E. Neuwirth, “Some strategies for Mastermind,” Zeitschrift fur Operations Research 26 (1982), B257-B278 [achieves 5658 ⇒ ≈4.3657]
Kenji Koyama and Tony W. Lai, “An optimal Mastermind strategy,” Journal of Recreational Mathematics 25 (1993), 251-256 [achieves 5626 ⇒ ≈4.34104938]
The last of these is the best possible, as it was found with an exhaustive depth-first search. (Note that all of these papers can do slightly better in the expected number of moves, if you allow them to take 6 moves sometimes... I gave the numbers with the “at most 5” constraint because that's what the question here asks for.)
You can make this more general (harder) by assuming the codemaker is adversarial and does not choose uniformly at random among the 1296 possible codewords, but according to whatever distribution will make it hardest for the codebreaker. Finally he mentions a lot of work done by Tom Nestor, which conclusively settles many such questions.
You might have fun trying to follow up or reproduce these results (e.g. write the exhaustive search program). Enjoy!

As far as I know, up till now there is no published work about this effect yet. I have made this observation some time ago, one can get better results by not always choosing the (canonically) first trial out of the "one-step-lookahead-set". I observed the different results by not starting with 1122 but with e.g. with 5544. One can also try to choose randomly and not use the canonically first. Yes, I agree with you, that is an interesting point - but a very, very special one.

ML.NET - Multiclass Classification score values

I currently have a project to take large bits of text and classify them as types. This is similar to the sentiment sample provided by microsoft except its multiclass instead of binary.
I have the code working just fine and will likely become stronger as we add data to it. However, i have hit a snag where i am unable to determine if the prediction just straight doesn't know what to choose. For my project it is much more valuable to not know the answer than to get it wrong. I am not sure if that is even a thing in ML.net. I was looking through documentation and the only thing i could find was the score value produced by the prediction. The problem therein lies that i don't know what any of the score values mean. I know they are broken out per class, but the numeric values are different between algorithms. Does anyone have any insight on these values? Or if any advice on the "don't know" vs "guessing" issue?
Appreciate your time, thanks.

The scores are largely learner-specific, the only requirement is that they are monotonic (higher score means higher likelihood of the example belonging to that class).
But in ML.NET multiclass learners they are always between 0 and 1, sum up to 1. You can think of the scores as 'predicted probabilities to belong to that class'.
Now to the question of how to take confidence into account. For a binary classification problem, I would have a standard recommendation: plot a precision-recall curve, and then instead of choosing one threshold on the score, choose two: one that gives a high-precision (potentially low-recall) positive, and another one that gives a high-precision potentially low recall) negative.
So:
if (score > threshold1)
return "positive";
else if (score < threshold2)
return "negative";
else
return "don't know";
For the multiclass case, you can employ the same procedure independently for each class. This way, you will have a per-class 'yes-no-maybe' answer.
You will have to deal with a potential for multiple 'yes', or other kinds of conflicts with this approach, but at least it gives an idea.

How many samples are optimal in one class using k-nearest neighbor?

I have implemented k-nearest algorithm in my system. It consists from 26 classes, each of 100 samples. In my case, K=7 and it was completely trial and error to get the best classification result.
I know that K should be chosen wisely to reduce the noise on the classification. But what about the number of samples? Is there any general rule such as "the more samples the better result"? Does it depend on something?
Thank you for all your responses.

You could try considering whatever underlying mechanism is generating your data, or whatever background knowledge you have on the problem, which might give you an idea of the relative size of noise and true underlying variation. E.g. predicting favourite sports team from location I would expect more change than predicting favourite sport, so would use smaller k. However I don't know of much general guidance, except to use cross-validation.

appropriate minimum support for itemset?

Please suggest me for any kind material about appropriate minimum support and confidence for itemset!
::i use apriori algorithm to search frequent itemset. i still don't know appropriate support and confidence for itemset. i wish to know what kinds of considerations to decide how big is the support.

The answer is that the appropriate values depends on the data.
For some datasets, the best value may be 0.5. But for some other datasets it may be 0.05. It depends on the data.
But if you set minsup =0 and minconf = 0, some algorithms will run out of memory before terminating, or you may run out of disk space because there is too many patterns.
From my experience, the best way to choose minsup and minconf is to start with a high value and then to lower them down gradually until you find enough patterns.
Alternatively, if you don't want to have to set minsup, you can use a top-k algorithms where instead of specifying minsup, you specify for example that you want the k most frequent rules. For example, k = 1000 rules.
If you are interested by top-k association rule mining, you can check my Java code here:
http://www.philippe-fournier-viger.com/spmf/
The algorithm is called TopKRules and the article describing it will be published next month.
Besides that, you need to know that there is many other interestingness measures beside the support and confidence: lift, all-confidence, ... To know more about this, you can read this article: "On selecting interestingness measures for association rules" and "A Survey of Interestingness Measures for Association Rules" Basically, all measures have some problems in some cases... no measure is perfect.
Hope this helps!

In any association rule mining algorithm, including Apriori, it is up to the user to decide what support and confidence values they want to provide. Depending on your dataset and your objectives you decide the minSup and minConf.
Obviously, if you set these values lower, then your algorithm will take longer to execute and you will get a lot of results.

The minimum support and minimum confidence parameters are a user preference. If you want a larger quantity of results (with lower statistical confidence), choose the parameters appropriately. In theory you can set them to 0. The algorithm will run, but it will take a long time, and the result will not be particularly useful, as it contains just about anything.
So choose them so that the result suit your needs. Mathematically, any value is "correct".

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js