Mahout 0.9 K-Means mapReduce analysis of the algorithm - mapreduce

I have been checking the algorithm of Mahout 0.9 k-means using MapReduce and I would like to know where can I check the code of what is happening inside the map function and in the reducer?
I was using debugging using NetBeans and I was not able to find what is exactly implemented in the Map and Reduce functions...
The reason what I am doing this is because I would like to know what is exactly implemented in the version of Mahout 0.9 in order to see which parts where optimized on the K-Means mapReduce algorithm.
If somebody knows which research paper the Mahout K-means were based on, that would also helped me a lot.
Thank you so much!
Best regards!

Download source code for mahout-core. Search for java file org.apache.mahout.clustering.kmeans.KMeansDriver.
In this java file search for line ClusterIterator.iterateMR(conf, input, priorClustersPath, output, maxIterations);
iterateMR function in class org.apache.mahout.clustering.iterator.ClusterIterator is the class which defines all configuration required for Map Reduce.
org.apache.mahout.clustering.iterator.CIMapper and org.apache.mahout.clustering.iterator.CIReducer are the Map reduce classes you are looking for.
Hope this helps!! :)
However, I do not know which research paper is implemented.

K-means (more precisely, Lloyds algorithm) is naively parallel. I doubt there is a paper discussing the implementation used by Mahout, because it's the obvious way to do so. There is absolutely no trick involved:
Lloyds algorithm consists mostly of a sum, and sums are trivially to parallelize.
Unfortunately (like much of Hadoop), Mahout is 10 layers thick abstraction. Which doesn't yield the best performance, but in particular makes it also really hard to dig through all the code and meta-code to the actual implementation. See the other answere here for pointers to the source code fragments scattered in a dozen classes.
When playing around with Mahout, make sure to also include non-Hadoop implementations of k-means in your experiments. You will be surprised how often they A) outperform Mahout, and B) provide better results.

Related

equivalent of wavedec (matlab function) in opencv

I am trying to rewrite a matlab code to cpp and I still blocked with this line :
[c, l]=wavedec(S,4,'Dmey');
Is there something like that in opencv ?
if someone have an idea about it try to share it with and thanks in advance.
Maybe if you would integrate your codes with Python, then PyWavelet might be an option.
I'm seeing the function that you're looking for is in there. That function is called Discrete Meyer (Dmey). Not sure what you're planning to do with that though, maybe you're processing some images or something, but Dmey is OK, not very widely used. You might want to just find some GitHub codes and integrate to whatever you're doing to see if it would work first, and based on those you can also change the details of your currently posted function (might find something more efficient).
1-D wavelet decomposition (wavedec)
In your code, c and l stand for coefficients and level. You're passing level four with a Dmey function. If you'd have one dimensional data, the following map is how your decomposition would look like, roughly I guess:
There are usually two types of decomposition models that are being used in Wavelets, one is called packet which is similar to a Full Binary Tree, from architecture standpoint:
The other one, which is the one you're most likely using, is less computationally expensive, because it does not decompose both branches of a tree, if you will. It'd just do the mathematical decomposition in one branch of the tree. Maybe, these images would shed some lights:
1 D
2 D
Notes:
If you have a working model in MatLab, you might want to see the C/C++ Code Generation in MatLab, will automatically convert MatLab codes to C++.
References:
Images are from Wikipedia or mathworks.com
Wiki
mathworks
Wavelet 2D

How to write tests for mathematical optimization procedures?

I'm working on project where I need to minimize functions by several variables like func(input_parameters, variable_parameters) -> min(variable_parameters).
I use optimizing functions from SciPy, so minimization process is a grey box: I can see the code on GitHub and read about used algorithms, but I'd like to think that it's okay and aim to testing of my own project.
Though, particular libraries shouldn't matter in this question.
At the moment I use few approaches:
Create simple examples and find global/local minima by hand and create test that performs optimization and compares its solution with the right one
If method needs gradients, compare analytically calculated gradients with their numerical approximation in tests
For iterative algorithms built upon ones provided by SciPy check that sequence of function values is monotonically nonincreasing in tests
Is there a book or an article about testing of mathematical optimization procedures?
P. S. I'm not talking about Test functions for optimization
, I'm asking about approaches used to test optimization procedure to find bugs faster.
I find the hypothesis library really useful for testing optimisation algorithms in development.
You can set it up to generate random test cases (functions, linear programs, etc) according to some specification. The idea is that you pass these to your algorithm and test for known invariants. For example you could have it throw random problems or subproblems at your algorithm and check that (for example):
Gradient descent methods produce a series of nonincreasing objectives
Local search finds a solution with no better neighbours
Heuristics maintain feasibility
There's a useful PyCon talk here explaining the idea of property based testing. It focuses more on testing APIs than algorithms, but I think the ideas transfer. I've found this approach does a pretty good job finding cases of unexpected behaviour as I'm writing a new algorithm.

Least Median of Squares robust regression C++

I have a set of data z(0), z(1), z(2)...,z(n) that I am currently fitting with a 2 variables polynomial of the kind p(x,y) = a(1)*x^2+a(2)*y^2+a(3)*x*y+a(4). I have i=1,...,n (x(i),y(i)) coordinates that I impose to be p(x(i),y(i))=z(i). In this way I have a Overdetermined System that I can solve using Eigen SVD . I am looking for a more sophisticated method that can take care of outliers, like a Least Median of Squares robust regression (as described here) but I haven't found a C++ implementation for 2 variables. I looked in GSL but it seems there is nothing for 2 variable functions. The only other solution I can think of is using a TGraph2D in ROOT. Do you know any other solution? Numerical recipes maybe? Since I am writing C++ code I would prefer C or C++ implementations.
Since non answer has been given yet, but I am still working on this problem, I will share my progresses here.
The class TLinearFitter has a fit method that allows you to select Robust fitting - Least Trimmed Squares regression (LTS):
https://root.cern.ch/root/html532/TLinearFitter.html
Another possible solution, more time consuming maybe, but maybe more efficient on the long run is to write my own function to be minimized, and the use:
https://projects.coin-or.org/Ipopt to minimize it. Although in this approach there is a bigger "step". I don't know how to use the library and I haven't (yet?) found a nice tutorial to understand it.
here: https://wis.kuleuven.be/stat/robust/software there is a Fortran implementation of the LMedS algorithm called PROGRESS. So another possible solution could be to port this software to C/C++ and make a library out of it.

How to implement 3d kDTree build and search algorithm in c or c++?

How to implement 3d kDTree build and search algorithm in c or c++? I am wondering if we have a working code to follow
I'd like to recommend you a two good presentations to start with:
"Introduction to k-d trees"
"Lecture 6: Kd-trees and range trees".
Both give all (basic ideas behind kd-trees, short visual examples and code snippets) you need to start writing your own implementation.
Update-19-10-2021: Resources are now longer public. Thanks to #Hari for posting new links down below in the comments.
I found the publications of Vlastimil Havran very useful. His Ph.d Thesis gives a nice introduction to kd-trees and traversal algorithms. Further articles are about several improvements, e.g. how to construct kd-tree in O(nlogn). There are also a lot of implementations in different graphic libs. You should just google for it.
For an example of a 3D kd-tree implementation in C, take a look at kd3. It is not general purpose library and requires the input data to be in a specific form, but the ideas and approach should be transferable.
Disclosure: I am the author of kd3.
Disclaimer: It was written as proof-of-concept code for an existing application and is therefore not as generic nor as well-tested as it should be. Bug reports/fixes are welcome.

HOT(Heap On Top) Queues

Could anyone point me to an example implementation of a HOT Queue or give some pointers on how to approach implementing one?
Here is a page I found that provides at least a clue toward what data structures you might use to implement this. Scroll down to the section called "Making A* Scalable." It's unfortunate that the academic papers on the subject mention having written C++ code but don't provide any.
Here is the link to the paper describing HOT queues. It's very abstract, that's why i wanted
to see a coded example (I'm still trying to fin my way around it).
http://www.star-lab.com/goldberg/pub/neci-tr-97-104.ps
The "cheapest", sort to speak variant of this is a two-level heap queue (maybe this sounds more familiar). What i wanted to do is to improve the running time of the Dijkstra's shortest path algorithm.
What i wanted to do is to improve the
running time of the Dijkstra's
shortest path algorithm.
Have you considered using the Boost Graph Library?
If you are using your own implementation of the algorithm you might already get better results using the one the BGL provides.
However It might be nontrivial to modify your code so it works with the BGL.
Of course speed-up could also be gained by not using Dijkstra at all but another algorithm.