How to use shuffle in KFold in scikit_learn

How to use shuffle in KFold in scikit_learn - python-2.7

I am running 10-fold CV using the KFold function provided by scikit-learn in order to select some kernel parameters. I am implementing this (grid_search)procedure:
1-pick up a selection of parameters
2-generate a svm
3-generate a KFold
4-get the data that correspons to training/cv_test
5-train the model (clf.fit)
6-classify with the cv_testdata
7-calculate the cv-error
8-repeat 1-7
9-When ready pick the parameters that provide the lowest average(cv-error)
If I do not use shuffle in the KFold generation, I get very much the same results for the average( cv_errors) if I repeat the same runs and the "best results" are repeatable.
If I use the shuffle, I am getting different values for the average (cv-errors) if I repeat the same run several times and the "best values" are not repeatable.
I can understand that I should get different cv_errors for each KFold pass but the final average should be the same.
How does the KFold with shuffle really work?
Each time the KFold is called, it shuffles my indexes and it generates training/test data. How does it pick the different folds for "training/testing"? Does it have a random way to pick the different folds for training/testing?
Any situations where its avantageous with "shuffle" and situations that are not??

If shuffle is True, the whole data is first shuffled and then split into the K-Folds. For repeatable behavior, you can set the random_state, for example to an integer seed (random_state=0).
If your parameters depend on the shuffling, this means your parameter selection is very unstable. Probably you have very little training data or you use to little folds (like 2 or 3).
The "shuffle" is mainly useful if your data is somehow sorted by classes, because then each fold might contain only samples from one class (in particular for stochastic gradient decent classifiers sorted classes are dangerous).
For other classifiers, it should make no differences. If shuffling is very unstable, your parameter selection is likely to be uninformative (aka garbage).

Related

How to choose the right number of dimension in UMAP?

I wanna try to use UMAP for my high-dimensional dataset as a preprocessing step (not for data visualization) in order to decrease the number of features, but how can I choose (if there is a method) the right number of dimensions in which to map the original data? For example, in PCA you can select the number of Factors that explain a fixed % of variances.

There is no good way to do this comparable to the explicit measure given by PCA. As a rule of thumb, however, you will get significantly diminishing returns for an embedding dimension larger than the n_neighbors value. With that in mind, and since you actually have a downstream task, it makes the most sense to build a pipeline to the downstream task evaluation and look at cross validation over the number of UMAP dimensions.

Comparing different implementations with random seeds

I have two implementations of a program, one using lists and one using vectors, in order to compare their runtimes. The class functions in each implementation are different, since the list implementation allows more flexibility in code. They both also use random number generators.
I set both to have random seed 0 and ran them, but the results I get are not the same.
One question I have is, if both implementations call a function using a random seed, e.g.
boost::variate_generator<boost::mt19937&, boost::exponential_distribution<>> random_n(seed, boost::exponential_distribution<>()) ;
and one calls it more times than the other implementation, will that cause desynchronization with respect to random seeds?
To be more specific, the vector implementation simulates a Poisson Process on a continuous real segment, e.g. [0,1], whereas the list implementation simulates the PP on separate partitions: {[0,0.1], [0.1,0.2], [0.2,0.3], ..., [0.9, 1]} and then combines the results. Simulating a PP on the big partition could mean as few as 1 boost::exponential_distribution calls, but simulating on the 10 partitions requires at least 10 boost::exponential_distribution calls, even if none of them may be used (e.g. if they overshoot the partition).
Even though probabilistically, these methods should generate the same kind of results, would the seeds between the programs be de-synchronized? And if so, is there any way to resynchronize them without changing the implementation?

Query re. how to set up an SVM, which SVM variation … and how to define a metric

I’d like to learn how best set up an SVM in openCV (or other C++ library) for my particular problem (or if indeed there is a more appropriate algorithm).
My goal is to receive a weighting of how well an input set of labeled points on a 2D plane compares or fits with a set of ‘ideal’ sets of labeled 2D points.
I hope my illustrations make this clear – the first three boxes labeled A through C, indicate different ideal placements of 3 points, in my illustrations the labelling is managed by colour:
The second graphic gives examples of possible inputs:
If I then pass for instance example input set 1 to the algorithm it will compare that input set with each ideal set, illustrated here:
I would suggest that most observers would agree that the example input 1 is most similar to ideal set A, then B, then C.
My problem is to get not only this ordering out of an algorithm, but also ideally a weighting of by how much proportion is the input like A with respect to B and C.
For the example given it might be something like:
A:60%, B:30%, C:10%
Example input 3 might yield something such as:
A:33%, B:32%, C:35% (i.e. different order, and a less 'determined' result)
My end goal is to interpolate between the ideal settings using these weights.
To get the ordering I’m guessing the ‘cost’ involved of fitting the inputs to each set maybe have simply been compared anyway (?) … if so, could this cost be used to find the weighting? or maybe was it non-linear and some kind of transformation needs to happen? (but still obviously, relative comparisons were ok to determine the order).
Am I on track?
Direct question>> is the openCV SVM appropriate? - or more specifically:
A series of separated binary SVM classifiers for each ideal state and then a final ordering somehow ? (i.e. what is the metric?)
A version of an SVM such as multiclass, structured and so on from another library? (...that I still find hard to conceptually grasp as the examples seem so unrelated)
Also another critical component I’m not fully grasping yet is how to define what determines a good fit between any example input set and an ideal set. I was thinking Euclidian distance, and I simply sum the distances? What about outliers? My vector calc needs a brush up, but maybe dot products could nose in there somewhere?
Direct question>> How best to define a metric that describes a fit in this case?
The real case would have 10~20 points per set, and time permitting as many 'ideal' sets of points as possible, lets go with 30 for now. Could I expect to get away with ~2ms per iteration on a reasonable machine? (macbook pro) or does this kind of thing blow up ?
(disclaimer, I have asked this question more generally on Cross Validated, but there isn't much activity there (?))

Hot to get absolute values instead of percentages in Perf’s annotate view?

Question
In Perf’s annotate view, the runtime consumed by each instruction is given on the left side as a percentage. Is there some option to have some absolute quantity (probably samples) displayed instead?
Background
I am using some C code with classical C arrays as well as with Numpy Arrays (for use in a Python module) and I want to compare the performances by having it run on some identical example cases. There are certain parts (e.g., initialisation) whose performance I know to differ and which I am not interested in. However, these affect the total runtime and thus render the percentage values for the other parts uncomparable – unless I want to transform the values myself. If I could access the total runtimes, I could easily compare the different variants piece by piece.

Are infinite lists useful for any real world applications?

I've been using haskell for quite a while now, and I've read most of Real World Haskell and Learn You a Haskell. What I want to know is whether there is a point to a language using lazy evaluation, in particular the "advantage" of having infinite lists, is there a task which infinite lists make very easy, or even a task that is only possible with infinite lists?

Here's an utterly trivial but actually day-to-day useful example of where infinite lists specifically come in handy: When you have a list of items that you want to use to initialize some key-value-style data structure, starting with consecutive keys. So, say you have a list of strings and you want to put them into an IntMap counting from 0. Without lazy infinite lists, you'd do something like walk down the input list, keeping a running "next index" counter and building up the IntMap as you go.
With infinite lazy lists, the list itself takes the role of the running counter; just use zip [0..] with your list of items to assign the indices, then IntMap.fromList to construct the final result.
Sure, it's essentially the same thing in both cases. But having lazy infinite lists lets you express the concept much more directly without having to worry about details like the length of the input list or keeping track of an extra counter.

An obvious example is chaining your data processing from input to whatever you want to do with it. E.g., reading a stream of characters into a lazy list, which is processed by a lexer, also producing a lazy list of tokens which are parsed into a lazy AST structure, then compiled and executed. It's like using Unix pipes.

I found it's often easier and cleaner to just define all of a sequence in one place, even if it's infinite, and have the code that uses it just grab what it wants.
take 10 mySequence
takeWhile (<100) mySequence
instead of having numerous similar but not quite the same functions that generate a subset
first10ofMySequence
elementsUnder100ofMySequence
The benefits are greater when different subsections of the same sequence are used in different areas.

Infinite data structures (including lists) give a huge boost to modularity and hence reusability, as explained & illustrated in John Hughes's classic paper Why Functional Programming Matters.
For instance, you can decompose complex code chunks into producer/filter/consumer pieces, each of which is potentially useful elsewhere.
So wherever you see real-world value in code reuse, you'll have an answer to your question.

Basically, lazy lists allow you to delay computation until you need it. This can prove useful when you don't know in advance when to stop, and what to precompute.
A standard example is u_n a sequence of numerical computations converging to some limit. You can ask for the first term such that |u_n - u_{n-1}| < epsilon, the right number of terms is computed for you.
Now, you have two such sequences u_n and v_n, and you want to know the sum of the limits to epsilon accuracy. The algorithm is:
compute u_n until epsilon/2 accuracy
compute v_n until epsilon/2 accuracy
return u_n + v_n
All is done lazily, only the necessary u_n and v_n are computed. You may want less simple examples, eg. computing f(u_n) where you know (ie. know how to compute) f's modulus of continuity.

Sound synthesis - see this paper by Jerzy Karczmarczuk:
http://users.info.unicaen.fr/~karczma/arpap/cleasyn.pdf
Jerzy Karczmarcuk has a number of other papers using infinite lists to model mathematical objects like power series and derivatives.
I've translated the basic sound synthesis code to Haskell - enough for a sine wave unit generator and WAV file IO. The performance was just about adequate to run with GHCi on a 1.5GHz Athalon - as I just wanted to test the concept I never got round to optimizing it.

Infinite/lazy structures permit the idiom of "tying the knot": http://www.haskell.org/haskellwiki/Tying_the_Knot
The canonically simple example of this is the Fibonacci sequence, defined directly as a recurrence relation. (Yes, yes, hold the efficiency complaints/algorithms discussion -- the point is the idiom.): fibs = 1:1:zipwith (+) fibs (tail fibs)
Here's another story. I had some code that only worked with finite streams -- it did some things to create them out to a point, then did a whole bunch of nonsense that involved acting on various bits of the stream dependent on the entire stream prior to that point, merging it with information from another stream, etc. It was pretty nice, but I realized it had a whole bunch of cruft necessary for dealing with boundary conditions, and basically what to do when one stream ran out of stuff. I then realized that conceptually, there was no reason it couldn't work on infinite streams. So I switched to a data type without a nil -- i.e. a genuine stream as opposed to a list, and all the cruft went away. Even though I know I'll never need the data past a certain point, being able to rely on it being there allowed me to safely remove lots of silly logic, and let the mathematical/algorithmic part of my code stand out more clearly.

One of my pragmatic favorites is cycle. cycle [False, True] generates the infinite list [False, True, False, True, False ...]. In particular, xs ! 0 = False, xs ! 1 = True, so this is just says whether or not the index of the element is odd or not. Where does this show up? Lot's of places, but here's one that any web developer ought to be familiar with: making tables that alternate shading from row to row.
The general pattern seen here is that if we want to do some operation on a finite list, rather than having to construct a specific finite list that will “do the thing we want,” we can use an infinite list that will work for all sizes of lists. camcann’s answer is in this vein.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js