Understanding Partition density of partitioned network - data-mining

i am implementing Link Communities community detection algorithm. I have trouble understanding explanation of partition density described in the paper
Here is only the part defining partition density:
I cannot find the connection between definition of link density (equation 2) and definition of partition density (equation 3). Because of that, i dont understand why is partition density defined the way it is. And i especially dont see how is (equation 3) calculating average of (equation 2) - if there is average, i would expect number of partitions (c) to be below horizontal line (in the divisor)
I could not google any other definition of partition density.
Anyone who can shed some light into it would be appreciated.

It is not really an average, but rather a weighted sum: each community is weighted depending on the proportion of total links it contains (i.e. mc). So instead of having a uniform factor of 1/C, where C is the number of communities, you have a community-specific factor of mc/M.

(2) is the density of one community, (3) is the weighted density sum of all communities, the weight is $m_c / M$.

Related

How to assess the variance explained by a principal component in a sub-set of data?

I have conducted a PCA (in Matlab) on a set of thousands of points of spatial data. I have also calculated the variance explained across the full dataset by each principal component (i.e. PC or eigenvector) by dividing its eigenvalue by the sum of all eigenvalues. As an example, PC 15 accounts for 2% of the variance in the entire dataset; however, there is a subset of points in this dataset for which I suspect PC 15 accounts for a much higher % of their variance (e.g. 80%).
My question is this, is there a way to calculate the variance explained by a given PC from my existing analysis for only a subset of points (i.e. 1000 pts from the full dataset of 500k+). I know that I could run another PCA on just the subset, but for my purposes, I need to continue to use the PCs from my original analysis. Any idea for how to do this would be very helpful.
Thanks!

Estimate the probability density of a given value if it belongs to a highly peaked multivariate dataset with high kurtosis (>100)

I have a dataset that have multiple variables with each of them heavily centered around zero to form a high peak. The kurtosis of each variable is more than 100.
What I want to estimate is the probability density of any given value if it belongs to the dataset. The most accessible distribution function I found currently is the multivariant Gaussian distribution. However, since my dataset is not is a normal shape and I am worried that it is inaccurate estimate the probability density using this function.
Does anyone have any good suggestions on which function to use to for this purpose?
You are repeating a common incorrect interpretation of kurtosis, namely, "peakedness," which contributes the confusion about what distribution to use.
Kurtosis does not measure "peakedness" at all. You can have a distribution with a perfectly flat peak, with a V-shaped peak, with a trimodal peak, with a wavy peak, or with any shape peak whatsoever, that has infinite kurtosis. And you can have a distribution with infinite peak than has negative (excess) kurtosis.
Instead, kurtosis is a measure of the tails (outlier potential) of the distribution, not the peak. The only reason people think that there is a "high peak" when there is high kurtosis is that the outliers stretch the horizontal scale of the histogram, making the data appear concentrated in a narrow vertical strip. But if you zoom in on the bulk of the data in that strip, the peak can have any shape whatsoever. Further, if you compare the height of your histogram of standardized data with the height of a corresponding standard normal, either can be higher, no matter what your data show. The "height" mythology was debunked around 1945 by Kaplansky.
For your data, you do not need a "peaked" distribution. Instead, you need a distribution that allows such extreme values as you have observed. Examples include mixture distributions, lognormal distributions, t distributions with small degrees of freedom, or multivariate versions of such, if that's what you need.
References:
Westfall, P.H. (2014). Kurtosis as Peakedness, 1905 – 2014. R.I.P. The American Statistician, 68, 191–195.
(A simplified discussion of the above paper is given in the talk section of the Wikipedia entry on kurtosis.)

How to calculate efficiently and accurately the Fourier transform of a radial function in Fortran

As my question states, I want to calculate the Fourier transform F(q) of a radial function f(r) (defined on [0,infinity[ and which decays like an exponential exp(-Ar +b) at large r) as accurately as possible in Fortran. The function values come from a data file (which I can easily interpolate through cubic interpolation for example and extrapolate since the behaviour at large r is known).
I'm using the "physics" definition of the Fourier transform in 3D, which gives (because f is radial) :
I first tried to calculate this integral for some chosen values of q by using Gauss-Legendre quadrature, by generating some 60 or 100 abscissas and weights via the NAG routine D01BCF (D01BCF link). In the case of Gauss Legendre quadrature, the problem is to choose the interval [0,B] on which to integrate. While the function f loses 4 to 5 orders of magnitude from r=10 to r=20 (example), the choice of B as a strong influence on the result of the calculation... When I compared the result I get to a "nearly exact" calculation (made with matlab but with a veeeery long computation time), I saw that in fact this was only valid for small values of q (of the order of 5, when I have to deal with values as large as 150). A Gauss-Laguerre quadrature does not give any better result, probably because of the oscillatory part of the integrand.
I then tried to compute this Fourier transform for some given values of q with the routine D01ASF (D01ASF link). It is a "one-dimensional quadrature, adaptive, semi-infinite interval, weight function cos(ωx) or sin(ωx) ", which is exactly what I need. The results are quite convincing for q up to 80 or 100 if I input absolute error tolerances of 10E-5. Problems are : I would need to go at larger q, and the Fourier transform F(q) oscillates with a magnitude of ~ 10E-6 at such q's. Lowering the tolerance to 10E-5 already takes some time and even makes the whole thing to output some error message from the subroutine so I don't know if 10E-6 would be feasible.
I'm thus currently wondering if trying to calculate this Fourier transform with FFT wouldn't be a good idea ? The problems I face are that I don't know how to calculate radial wave functions with FFT (and also that I don't even know how to use FFT properly either since the definition of the transform is not even the same (exponent sign and argument) and that I never used it before).
Would you have ideas ? :)
EDIT 2 : I tried by FFT (using the routine C06FAF from NAG library). It works quite well up to some large values of q. The problem I face is that there is always some constant normalising factor to account for. I don't get why. This normalising factor evolves with the number N of points used in the mesh. It has the for of a power law : Normalising Factor F = N^(-0.5) x exp(9.9) approximately (see figure where the black line is the "exact" Fourier Transform and the green, magenta, blue, red and yellow lines are the FFT calculated for different values of N)
EDIT3 : I found the factor to be A*N^(-0.5) where A is the length of the integration mesh

Estimating/Choosing optimal Hyperparameters for DBSCAN

I need to find naturally occurring classes of nouns based on their distribution with different preposition (like agentive, instrumental, time, place etc.). I tried using k-means clustering but of less help, it didn't work well, there was a lot of overlap over the classes that I was looking for (probably because of non-globular shape of classes and random initialisation in k-means).
I am now working on using DBSCAN, but I have trouble understanding the epsilon value and mini-points value in this clustering algorithm. Can I use random values or do I need to compute them. Can anybody help? Particularly with epsilon, at least how to compute it if I need to?
Use your domain knowledge to choose the parameters. Epsilon is a radius. You can think of it as a minimum cluster size.
Obviously random values won't work very well. As a heuristic, you can try to look at a k-distance plot; but it's not automatic either.
The first thing to do either way is to choose a good distance function for your data. And perform appropriate normalization.
As for "minPts" it again depends on your data and needs. One user may want a very different value than another. And of course minPts and Epsilon are coupled. If you double epsilon, you will roughly need to increase your minPts by 2^d (for Euclidean distance, because that is how the volume of a hypersphere increases!)
If you want lots of small and fine detailed clusters, choose a low minpts. If you want larger and fewer clusters (and more noise), use a larger minpts. If you don't want any clusters at all, choose minpts larger than your data set size...
It is highly important to select the hyperparameters of DBSCAN algorithm rightly for your dataset and the domain in which it belongs.
eps hyperparameter
In order to determine the best value of eps for your dataset, use the K-Nearest Neighbours approach as explained in these two papers: Sander et al. 1998 and Schubert et al. 2017 (both papers from the original DBSCAN authors).
Here's a condensed version of their approach:
If you have N-dimensional data to begin, then choose n_neighbors in sklearn.neighbors.NearestNeighbors to be equal to 2xN - 1, and find out distances of the K-nearest neighbors (K being 2xN - 1) for each point in your dataset. Sort these distances out and plot them to find the "elbow" which separates noisy points (with high K-nearest neighbor distance) from points (with relatively low K-nearest neighbor distance) which will most likely fall into a cluster. The distance at which this "elbow" occurs is your point of optimal eps.
Here's some python code to illustrate how to do this:
def get_kdist_plot(X=None, k=None, radius_nbrs=1.0):
nbrs = NearestNeighbors(n_neighbors=k, radius=radius_nbrs).fit(X)
# For each point, compute distances to its k-nearest neighbors
distances, indices = nbrs.kneighbors(X)
distances = np.sort(distances, axis=0)
distances = distances[:, k-1]
# Plot the sorted K-nearest neighbor distance for each point in the dataset
plt.figure(figsize=(8,8))
plt.plot(distances)
plt.xlabel('Points/Objects in the dataset', fontsize=12)
plt.ylabel('Sorted {}-nearest neighbor distance'.format(k), fontsize=12)
plt.grid(True, linestyle="--", color='black', alpha=0.4)
plt.show()
plt.close()
k = 2 * X.shape[-1] - 1 # k=2*{dim(dataset)} - 1
get_kdist_plot(X=X, k=k)
Here's an example resultant plot from the code above:
From the plot above, it can be inferred that the optimal value for eps can be assumed at around 22 for the given dataset.
NOTE: I would strongly advice the reader to refer to the two papers cited above (especially Schubert et al. 2017) for additional tips on how to avoid several common pitfalls when using DBSCAN as well as other clustering algorithms.
There are a few articles online –– DBSCAN Python Example: The Optimal Value For Epsilon (EPS) and CoronaVirus Pandemic and Google Mobility Trend EDA –– which basically use the same approach but fail to mention the crucial choice of the value of K or n_neighbors as 2xN-1 when performing the above procedure.
min_samples hyperparameter
As for the min_samples hyperparameter, I agree with the suggestions in the accepted answer. Also, a general guideline for choosing this hyperparameter's optimal value is that it should be set to twice the number of features (Sander et al. 1998). For instance, if each point in the dataset has 10 features, a starting point to consider for min_samples would be 20.

Minimizing Sum of Distances: Optimization Problem

The actual question goes like this:
McDonald's is planning to open a number of joints (say n) along a straight highway. These joints require warehouses to store their food. A warehouse can store food for any number of joints, but has to be located at one of the joints only. McD has a limited number of warehouses (say k) available, and wants to place them in such a way that the average distance of joints from their nearest warehouse is minimized.
Given an array (n elements) of coordinates of the joints and an integer 'k', return an array of 'k' elements giving the coordinates of the optimal positioning of warehouses.
Sorry, I don't have any examples available since I'm writing this down from memory. Anyway, one sample could be:
array={1,3,4,5,7,7,8,10,11} (n=9)
k=1
Ans: {7}
This is what I've been thinking: For k=1, we can simply find out the median of the set, which would give the optimal location of the warehouse. However, for k>1, the given set should be divided into 'k' subsets (disjoint, and of contiguous elements of the superset), and median for each subset would give the warehouse locations. However, I don't understand on what basis the 'k' subsets should be formed. Thanks in advance.
EDIT: There's a variation to this problem also: Instead of sum/avg, minimize the maximum distance between a joint and its closest warehouse. I don't get this either..
The straight highway makes this an exercise in dynamic programming, working from left to right along the highway. A partial solution can be described by the location of the rightmost warehouse and the number of warehouses placed. The cost of the partial solution will be the total distance to the nearest warehouse (for fixed k minimising this is the same as minimising the averge) or the maximum distance so far to the closest warehouse.
At each stage you have worked out the answers for the leftmost N joints and have them indexed by number of warehouses used and position of the rightmost warehouse - you need to save only the best cost. Now consider the next joint and work out the best solution for N+1 joints and all possible values of k and rightmost warehouse, using the answers you have stored for N joints to speed this up. Once you have worked out the best cost solution covering all the joints you know where its rightmost warehouse is, which gives you the location of one warehouse. Go back to the solution that has that warehouse as the rightmost joint and find out what solution that was based on. That gives you one more rightmost warehouse - and so you can work your way back to the location of all the warehouses for the best solution.
I tend to get the cost of working this out wrong, but with N joints and k warehouses to place you have N steps to take, each of the based on considering no more than Nk previous solutions, so I reckon cost is kN^2.
This is NOT a clustering problem, it's a special case of a facility location problem. You can solve it using a general integer / linear programming package, but because the problem is on a line, there may be more efficient (and less expensive software-wise) algorithms that would work. You might consider dynamic programming since there are probably combination of facilities that could be eliminated rather quickly. Look into the P-Median problem for more info.