Hot to get absolute values instead of percentages in Perf’s annotate view? - profiling

Question
In Perf’s annotate view, the runtime consumed by each instruction is given on the left side as a percentage. Is there some option to have some absolute quantity (probably samples) displayed instead?
Background
I am using some C code with classical C arrays as well as with Numpy Arrays (for use in a Python module) and I want to compare the performances by having it run on some identical example cases. There are certain parts (e.g., initialisation) whose performance I know to differ and which I am not interested in. However, these affect the total runtime and thus render the percentage values for the other parts uncomparable – unless I want to transform the values myself. If I could access the total runtimes, I could easily compare the different variants piece by piece.

Related

Word2Vec Wordvectors Most similar

I trained a Word2Vec Model an I am trying formulate the most_similar function mathematicaly.
I thought about a set, that contains the n most similar word, given a word as reference.
Exist there somewhere a good definition?
You can view the source code which implements most_similar() for the gensim Python library's KeyedVectors abstraction (for holding & performing common actions on sets of word-vectors):
https://github.com/RaRe-Technologies/gensim/blob/fbc7d0952f1461fb5de3f6423318ae33d87524e3/gensim/models/keyedvectors.py#L491
Roughly, it first computes a target vector – by combining any positive & negative examples the caller has provided. In the common case, this might just be a single ('positive') word-vector.
Then, it calculates the cosine-similarity with every other vector, and sorts those similarities for highest, and returns the top-N results.

Is there a adaptable algorithm(or code) for multiple parameters optimizatin in python?

I'm making a multiple parameters optimization tool for a simulator in python. I have 7 parameters and with changing the parameters, the 5 result items change. (each parameter has different boundary.)
I dont know the simulator's eqaution. So, I think I have to initiate random value and iterate an optimization algorithm until finding the parameter values which make the 5 items close to objective values. Could you advice me to the adaptable algorithm? If you give me a sample code, It would be better for me to understand.
Thanks in advance.
I tried GA, but It takes too much time and It' couldn't find adaptable value. I think It's because the boundary is too large and many parameters to change.
There are many libraries in python dedicated to numerical optimization. I would recommend scipy.optimize for more simple tasks like the one you are talking about and pyomo for more complex optimization problems.
Problem type
Let's look at scipy.optimize. First, you need to know whether your optimization problem is convex or non-convex. Convex basically means there is only a single local minimum that we want to find. Non-convex problems can have multiple local minimums where the optimization algorithms can get stuck.
Convex problems
For convex problems, we can simply refer to scipy.optimize.minimize. It requires a function f(x) that we want to minimize, as well as an initial value x0 and (if available) the variable bounds.
A simple example:
from math import inf
import numpy as np
from scipy.optimize import minimize
def func(x):
simulation_result = sim(x) # use simulator here
objective_vector = np.array([1,2,3,4,5]) # Replace this with your objective target vector
return np.linalg.norm(simulation_result - objective_vector)
res = minimize(func, x0=np.ones(7,1),
bounds=[(1,2),(10,20),(0,1),(0,1),(0,inf),(-inf,0)])
if res.success:
print(res.x)
Non-convex problems
This problem class is a lot harder and requires much more advanced algorithms. Luckily scipy.optimize also provides algorithms for this!
Check out my answer here and the documentation.

C++ support vector machine (SVM) template libraries?

I have a dataset from custom abstract objects and a custom distance function. Is there any good SVM libraries that allows me to train on my custom objects (not 2d points) and my custom distance function?
I searched the answers in this similar stackoverflow question, but none of them allows me to use custom objects and distance functions.
First things first.
SVM does not work on distance functions, it only accepts dot products. So your distance function (actually similarity, but usually 1-distance is similarity) has to:
be symmetric s(a,b)=s(b,a)
be positive definite s(a,a)>=0, s(a,a)=0 <=> a=0
be linear in first argument s(ka, b) = k s(a,b) and s(a+b,c) = s(a,c) + s(b,c)
This can be tricky to check, as you actually ask "is there a function from my objects to some vector space, phi such that s(phi(x), phi(y))" is a dot-product, thus leading to definition of so called kernel, K(x,y)=s(phi(x), phi(y)). If your objects are themselves elements of vector space, then sometimes it is enough to put phi(x)=x thus K=s, but it is not true in general.
Once you have this kind of similarity nearly any SVM library (for example libSVM) works with providing Gram matrix. Which is simply defined as
G_ij = K(x_i, x_j)
Thus requiring O(N^2) memory and time. Consequently it does not matter what are your objects, as SVM only works on pairwise dot-products, nothing more.
If you look appropriate mathematical tools to show this property, what can be done is to look for kernel learning from similarity. These methods are able to create valid kernel which behaves similarly to your similarity.
Check out the following:
MLPack: a lightweight library that provides lots of functionality.
DLib: a very popular toolkit that is used both in industry and academia.
Apart from these, you can also use Python packages, but import them from C++.

Query re. how to set up an SVM, which SVM variation … and how to define a metric

I’d like to learn how best set up an SVM in openCV (or other C++ library) for my particular problem (or if indeed there is a more appropriate algorithm).
My goal is to receive a weighting of how well an input set of labeled points on a 2D plane compares or fits with a set of ‘ideal’ sets of labeled 2D points.
I hope my illustrations make this clear – the first three boxes labeled A through C, indicate different ideal placements of 3 points, in my illustrations the labelling is managed by colour:
The second graphic gives examples of possible inputs:
If I then pass for instance example input set 1 to the algorithm it will compare that input set with each ideal set, illustrated here:
I would suggest that most observers would agree that the example input 1 is most similar to ideal set A, then B, then C.
My problem is to get not only this ordering out of an algorithm, but also ideally a weighting of by how much proportion is the input like A with respect to B and C.
For the example given it might be something like:
A:60%, B:30%, C:10%
Example input 3 might yield something such as:
A:33%, B:32%, C:35% (i.e. different order, and a less 'determined' result)
My end goal is to interpolate between the ideal settings using these weights.
To get the ordering I’m guessing the ‘cost’ involved of fitting the inputs to each set maybe have simply been compared anyway (?) … if so, could this cost be used to find the weighting? or maybe was it non-linear and some kind of transformation needs to happen? (but still obviously, relative comparisons were ok to determine the order).
Am I on track?
Direct question>> is the openCV SVM appropriate? - or more specifically:
A series of separated binary SVM classifiers for each ideal state and then a final ordering somehow ? (i.e. what is the metric?)
A version of an SVM such as multiclass, structured and so on from another library? (...that I still find hard to conceptually grasp as the examples seem so unrelated)
Also another critical component I’m not fully grasping yet is how to define what determines a good fit between any example input set and an ideal set. I was thinking Euclidian distance, and I simply sum the distances? What about outliers? My vector calc needs a brush up, but maybe dot products could nose in there somewhere?
Direct question>> How best to define a metric that describes a fit in this case?
The real case would have 10~20 points per set, and time permitting as many 'ideal' sets of points as possible, lets go with 30 for now. Could I expect to get away with ~2ms per iteration on a reasonable machine? (macbook pro) or does this kind of thing blow up ?
(disclaimer, I have asked this question more generally on Cross Validated, but there isn't much activity there (?))

How to use shuffle in KFold in scikit_learn

I am running 10-fold CV using the KFold function provided by scikit-learn in order to select some kernel parameters. I am implementing this (grid_search)procedure:
1-pick up a selection of parameters
2-generate a svm
3-generate a KFold
4-get the data that correspons to training/cv_test
5-train the model (clf.fit)
6-classify with the cv_testdata
7-calculate the cv-error
8-repeat 1-7
9-When ready pick the parameters that provide the lowest average(cv-error)
If I do not use shuffle in the KFold generation, I get very much the same results for the average( cv_errors) if I repeat the same runs and the "best results" are repeatable.
If I use the shuffle, I am getting different values for the average (cv-errors) if I repeat the same run several times and the "best values" are not repeatable.
I can understand that I should get different cv_errors for each KFold pass but the final average should be the same.
How does the KFold with shuffle really work?
Each time the KFold is called, it shuffles my indexes and it generates training/test data. How does it pick the different folds for "training/testing"? Does it have a random way to pick the different folds for training/testing?
Any situations where its avantageous with "shuffle" and situations that are not??
If shuffle is True, the whole data is first shuffled and then split into the K-Folds. For repeatable behavior, you can set the random_state, for example to an integer seed (random_state=0).
If your parameters depend on the shuffling, this means your parameter selection is very unstable. Probably you have very little training data or you use to little folds (like 2 or 3).
The "shuffle" is mainly useful if your data is somehow sorted by classes, because then each fold might contain only samples from one class (in particular for stochastic gradient decent classifiers sorted classes are dangerous).
For other classifiers, it should make no differences. If shuffling is very unstable, your parameter selection is likely to be uninformative (aka garbage).