Cluster binary data - data-mining

Cluster binary data - data-mining

I have a large data set with BINARY user/items feature matrix:
I need to cluster both users and items. Is there anyway to do them simultaneously in Mahout?
More importantly, if I use loglikelihood as a similarity measure, what clustering
algorithms will actually support such distance metric to cluster the data?

No, clustering by users and items are separate processes. Though in spirit it's exactly the same process, just applied two different ways.
If you want more specific answers within Mahout you'd have to say more about what parts of the code you are using because there are several different parts that involve clustering.
There are some agglomerative clustering pieces in the project, which works for any similarity metric. The other implementations that I know of are definitely of the "k-means" variety, assuming a continuous vector space and not vectors over {0,1}. You would need a k-medoids kind of algorithm I think and this isn't in the project that I know of.

Related

Best way to feature select using PCA (discussion)

Terminology:
Component: PC
loading-score[i,j]: the j feature in PC[i]
Question:
I know the question regarding feature selection is asked several times here at StackOverflow (SO) and on other tech-pages, and it proposes different answers/discussion. That is why I want to open a discussion for the different solutions, and not post it as a general question since that has been done.
Different methods are proposed for feature selection using PCA: For instance using the dot product between the original features and the components (here) to get their correlation, a discussion at SO here suggests that you can only talk about important features as loading-scores in a component (and not use that importance in the input space), and another discussion at SO (which I cannot find at the moment) suggest that the importance for feature[j] would be abs(sum(loading_score[:,j]) i.e the sum of the absolute value of loading_score[i,j] for all i components.
I personally would think that a way to get the importance of a feature would be an absolute sum where each loading_score[i,j] is weighted by the explained variance of component i i.e
imp_feature[j]=sum_i (abs(loading_score[i,j])*explained_variance[i].

Well, there is no universal way to select features; it totally depends on the dataset and some insights available about the dataset. I will provide some examples which might be helpful.
Since you asked about PCA, initially it separates the whole dataset into two sets under which the variances. On the other ICA (Independent Component Analysis) is able to extract multiple features simultaneously. Look at this example,
In this example, we mix three independent signals and try to separate out them using ICA and PCA. In this case, ICA is doing it a better way than PCA. In general, if you search Blind Souce Separation (BSS) you may find more information about this. Besides, in this example, we know the independent components thus, separation is easy. In general, we do not know the number of components. Hence, you may have to guess based on some prior information about the dataset. Also, you may use LDA (Linear Discriminate Analysis) to reduce the number of features.
Once you extract PC components using any of the techniques, following way we can visualize it. If we assume, those extracted components as random variables i.e., x, y, z
More information about you may refer to this original source where I took about two figures.
Coming back to your proposition,
imp_feature[j]=sum_i (abs(loading_score[i,j])*explained_variance[i]
I would not recommend this way due to the following reasons:
abs(loading_score[i,j]) when we get absolute values you may loose positive or negative correlations of considered features. explained_variance[i] may be used to find the correlation between features, but multiplying does not make any sense.
Edit:
In PCA, each component has its explained variance. Explained variance is the ratio between individual component variance and total variance(sum of all individual components variances). Feature significance can be measured by magnitude of explained variance.
All in all, what I want to say, feature selection totally depends on the dataset and the significance of features. PCA is just one technique. Frist understand the properties of features and the dataset. Then, try to extract features. Hope this helps. If you can provide us with an exact example, we may provide more insights.

Train doc2vec for company name similarity

I am trying to deduplicate a huge list of companies (40M+) using the name similarities. I have a 500K of company name pairs labelled same/not-same (like I.B.M.=International Business Machines). Model built by logistic regression on vector difference of name pairs has a great f-score (0.98) but the inference (finding the most similar names) is too slow (almost 2 secs per name).
Is it possible to train doc2vec model using name similarity pairs (positive and negative), resulting in similar names has similar vectors so that I can use fast vector similarities algorithms like Annoy?

Searching for the top-N nearest-neighbors in high-dimensional spaces is hard. To get a perfectly accurate top-N typically requires an exhaustive search, which is probably the reason for your disappointing performance.
When some indexing can be applied, as with the ANNOY library, some extra indexing time and index-storage is required, and accuracy is sacrificed because some of the true top-N neighbors can be missed.
You haven't mentioned how your existing vectors are created. You don't need to adopt a new vector-creation method (like doc2vec) to use indexing; you can apply indexing libraries to your existing vectors.
If your existing vectors are sparse (as for example if they are big bag-of-character-n-grams representations, with many dimensions but most 0.0), you might want to look into Facebook's PySparNN library.
If they're dense, in addition to the ANNOY you mentioned, Facebook FAISS can be considered.
But also, even the exhaustive search-for-neighbors is highly parallelizable: split the data into M shards on M different systems, and finding the top-N on each is often close to 1/Nth the time of the same operation on the full index, then merging the M top-N lists relatively quick. So if finding the most-similar is your key bottleneck, and you need the top-N most-similar in say 100ms, throw 20 machines at 20 shards of the problem.
(Similarly, the top-N results for all may be worth batch-calculating. If you're using cloud resources, rent 500 machines to do 40 million 2-second operations, and you'll be done in under two days.)

CBIR with SIFT alike features, discrete- vs. continuous-approach

currently I'm dealing with implementing a CBIR-System for object recognition (object classification in detail) and now since I have some working feature-detectors and -descriptors I try to find the best way for handling these features for the task of content based image retrieval.
As far as I know there are two main trends for this task, the discrete- and the continuous-approach. Where discrete stands for methods like bag-of-visual words and codebooks for building up inverted indices to apply methods referring text-retrieval, and continuous stands for methods like Best Bin First search with k-d trees and nearest neighbor classification.
So one main difference between those both approaches is, one works with an extra representation for features like visual-words and the other one works with the n-D features calculated from the descriptor.
My question is now, is there any comparison between the two method for CBIR which could help me in finding the best approach for my task?

The full answer to this question would be quite complex and long.
but generally, a continuous method can give you more accurate results, but it's slower as you can effectively build a search index, and you need to work with large descriptors.
you should consider a combination that uses discrete features (visual words) for initial results, and afterwards filter the result set using continuous methods.

Building an Intrusion Detection System using fuzzy logic

I want to develop an Intrusion Detection System (IDS) that might be used with one of the KDD datasets. In the present case, my dataset has 42 attributes and more than 4,000,000 rows of data.
I am trying to build my IDS using fuzzy association rules, hence my question: What is actually considered as the best tool for fuzzy logic in this context?

Fuzzy association rule algorithms are often extensions of normal association rule algorithms like Apriori and FP-growth in order to model uncertainty using probability ranges. I thus assume that your data consists of quite uncertain measurements and therefore you want to group the measurements in more general ranges like e.g. 'low'/'medium'/'high'. From there on you can use any normal association rule algorithm to find the rules for your IDS (I'd suggest FP-growth as it has lower complexity than Apriori for large data sets).

Boost Graph Library: Is there a neat algorithm built into BGL for community detection?

Anybody out there using BGL for large production servers?
How many node does your network consist of?
How do you handle community detection
Does BGL have any cool ways to detect communities?
Sometimes two communities might be linked together by one or two edges, but these edges are not reliable and can fade away. Sometimes there are no edges at all.
Could someone speak briefly on how to solve this problem.
Please open my mind and inspire me.
So far I have managed to work out if two nodes are on an island (in a community)
in a lest expensive manner, but now I need to work out which two nodes on separate islands are closest to each other. We can only make minimal use of unreliable geographical data.
If we figuratively compare it to a mainland and an island and take it out of social distance context. I want to work out which two bits of land are the closest together across a body of water.

I've used the BGL for graphs with millions of nodes, but the size of the graph you can use depends on what algorithm you are trying to run. You can quickly compute distances between nodes. There are 4 shortest path algorithms which are most applicable depending on your data: (single pairs of points, for all pairs of points, sparse and dense graphs,...).
As for community detection, there aren't any algorithms built-into the BGL specifically for that (but maybe you can contribute one when you are finished with your project). There are a few algorithms that might be helpful in building a community detection algorithm. The max-flow/min-cut algorithms are typically used in community detection (if there is a lot of flow possible between two nodes, then they are likely to be in the same community, if there isn't much flow, then the min-cut is likely to represent roads between communities). There are also heuristics to order the nodes of the graph to reduce bandwidth. Nodes making up "communities" are likely to be close to each other in such an ordering.

As far as I know BGL doesn't have any algorithms specifically for community detection.
By "island" do you mean a disconnected subgraph?
Also, graphs do not have any notion of 'distance'.
This 'social distance' is something that you are going to have to define. Once you've done that a large part of the work is done.
There are numerous methods listed on the page you linked to, most of those only require you to define something like a 'distance' metric, and then plug your definitions into the algorithm.
# David Nehme
Graphs without edge-weights are only about connectedness, they have no notion of distance. If you want to talk about a network then you can talk about distance. But a graph with no edge-weights does not have any distance, unless you want to assume an implied edge-weight of 1 for all edges. But this is really just turning the graph into a network.
Also, he is talking about the distance between two disconnected graphs. To model this, you have to introduce an external concept for distance between nodes, separate from the edge-distance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js