A-close data mining implementation - data-mining

I need to compare the Apriori and the A-close algorithm on a dataset so I need the implementations of both algorithms. I can find implementions of the Apriori algorithm but I can't find implementations of the A-close algorithm. It's saves me lots of time when I find a implementation of the A-close algorithm. Does someone has the implementation of this algorithm and wants to share it or some tips for finding this implementation?

Related

Mahout 0.9 K-Means mapReduce analysis of the algorithm

I have been checking the algorithm of Mahout 0.9 k-means using MapReduce and I would like to know where can I check the code of what is happening inside the map function and in the reducer?
I was using debugging using NetBeans and I was not able to find what is exactly implemented in the Map and Reduce functions...
The reason what I am doing this is because I would like to know what is exactly implemented in the version of Mahout 0.9 in order to see which parts where optimized on the K-Means mapReduce algorithm.
If somebody knows which research paper the Mahout K-means were based on, that would also helped me a lot.
Thank you so much!
Best regards!
Download source code for mahout-core. Search for java file org.apache.mahout.clustering.kmeans.KMeansDriver.
In this java file search for line ClusterIterator.iterateMR(conf, input, priorClustersPath, output, maxIterations);
iterateMR function in class org.apache.mahout.clustering.iterator.ClusterIterator is the class which defines all configuration required for Map Reduce.
org.apache.mahout.clustering.iterator.CIMapper and org.apache.mahout.clustering.iterator.CIReducer are the Map reduce classes you are looking for.
Hope this helps!! :)
However, I do not know which research paper is implemented.
K-means (more precisely, Lloyds algorithm) is naively parallel. I doubt there is a paper discussing the implementation used by Mahout, because it's the obvious way to do so. There is absolutely no trick involved:
Lloyds algorithm consists mostly of a sum, and sums are trivially to parallelize.
Unfortunately (like much of Hadoop), Mahout is 10 layers thick abstraction. Which doesn't yield the best performance, but in particular makes it also really hard to dig through all the code and meta-code to the actual implementation. See the other answere here for pointers to the source code fragments scattered in a dozen classes.
When playing around with Mahout, make sure to also include non-Hadoop implementations of k-means in your experiments. You will be surprised how often they A) outperform Mahout, and B) provide better results.

What algorithm opencv GCGRAPH (max flow) is based on?

opencv has an implementation of max-flow algorithm (class GCGRAPH in file gcgraph.hpp). It's available here.
Does anyone know which particular max-flow algorithm is implemented by this class?
I am not 100% confident about this, but I believe that the algorithm is based on this research paper describing max-flow algorithms for computer vision. Specifically, Section 3 describes a new algorithm for computing maximum flows.
I haven't lined up every detail of the paper's algorithm with the implementation of the algorithm, but many details seem to match:
The algorithm described works by using a bidirectional search from both s and t, which the implementation is doing as well: for example, there's a comment reading // grow S & T search trees, find an edge connecting them.
The algorithm described keeps track of a set of orphaned nodes, which the variable std::vector<Vtx*> orphans seems to track in the implementation.
The algorithm described works by building up a set of trees and reusing them; the algorithm implementation keeps track of a tree associated with each node.
I hope this helps!

K-nearest neighbour C/C++ implementation

Where can I find an serial C/C++ implementation of the k-nearest neighbour algorithm?
Do you know of any library that has this?
I have found openCV but the implementation is already parallel.
I want to start from a serial implementation and parallelize it with pthreads openMP and MPI.
Thanks,
Alex
How about ANN? http://www.cs.umd.edu/~mount/ANN/. I have once used the kdtree implementation, but there are other options.
Quoting from the website: "ANN is a library written in C++, which supports data structures and algorithms for both exact and approximate nearest neighbor searching in arbitrarily high dimensions."
I wrote a C++ implementation for a KD-tree with nearest neighbor search. You can easily extend it for K-nearest neighbors by adding a priority queue.
Update: I added support for k-nearest neighbor search in N dimensions
The simplest way to implement this is to loop through all elements and store K nearest. (just comparing). Complexity of this is O(n) which is not so good but no preprocessing is needed. So now really depends on your application. You should use some spatial index to partition area where you search for knn. For some application grid based spatial structure is just fine (just divide your world into fixed block and search only within closes blocks first). This is good when your entities are evenly distributed. Better approach is to use some hierarchical structure like kd-tree... It really all depends on what you need
for more information including pseudocode look in these presentations:
http://www.ulozto.net/xCTidts/dpg06-pdf
http://www.ulozto.net/xoh6TSD/dpg07-pdf

Hash Table Implementation Using An Array of Linked Lists

This question has been bugging me for quite a long time and today I've read a detailed article related to hash tables. Without checking any implementation examples I wanted to give a shot for writing a Hash Table from scratch.
The seperate chaining method gave me the idea of implementing the hash table. Anyone who has experience on data structures might regard this question as a joke but i'm a beginner and without diving straight at the code I wanted to discuss my implementations efficiency. Would it be efficient or any other fundamental ideas could be preferred than this?
I think for starters you can also peek into the source (or documentations) of some hash maps implemented in boost libraries. It is called unordered_map. (link is here)
As long as you don't know about these implementations and want to use a hash and you are annoyed because it is not in STL, you are intrigued to write your own fast datastore.
But now implementing hash-maps are so much out of the game that C++11 has unordered_map in its STL. You'll see there are plenty of more interesting stuff out there.
Note: separate chaining is called bucket hash. In fact, boost uses bucket hash, see this link. Maybe you could rather look up some performance comparisons. Chances are those who do perf's will write good enough implementations.
Using closed addressing, another alternative is to use a self balancing binary search tree, e.g. red-black tree/std::map or heap tree, for the inner data structure, or even another hash map with different hashing algorithm.
Using open addressing, another alternative to linear probing are quadratic probing and double hashing; there are also less commonly used strategies such as cuckoo hashing, hopscotch hashing, etc.
The key points of implementing hash table is choosing the right hashing algorithm, resizing strategy (load factor), and collision resolution strategy. The best strategy is highly dependant on the type of workload that you're expecting as there are tradeoffs for each approach.

Common patterns in a database

I need to find common patterns in a database of sequences of events. So, I have considered the longest common substring problem and the python implementation searching for a solution.
Note that I am not searching for the longest common substring only: I accept shorter common substrings appearing frequently in the database.
Can you suggest some algorithm, implementation tricks or general advice about this problem?
The previous answer suggested Apriori. But Apriori is inappropriate if you want to find frequent sequences because Apriori does not consider the time (also, Apriori is an inefficient algorithm).
If you want to find subsequences that are common to several sequences, it would be more appropriate to use a sequential pattern mining algorithm such as PrefixSpan and SPAM.
If you want to make some predictions, another option would also be to use a sequential rule mining algorithm.
I have open-source Java implementations of sequential pattern mining and sequential rule mining algorithm that you can download from my website: http://www.philippe-fournier-viger.com/spmf/
I don't think that you could process 8 GB of data in one shot with these algorithms. But it could be a starting point. Actually, some of these algorithms could be adapted for the case of very large databases by implementing a disk-based strategy.
Have you considered Frequent Itemset Mining methods such as Apriori?