DIfference Between two Clusters - data-mining

So, I have a thing to do, but i need an advice how to do that. My data points is: 1,2,9,6,4 and I need to compute distance between clusters. I need to you Euclidean distance.
My answer was: {1,1} = 0. {1,2}=1 , {1,9} = 8. Am i doing correct or not?

So you have 5 data points, right?
the formulas should be this:
square root of ((1-1)²) = 0
square root of ((1-2)²) = 1
square root of ((1-9)²) = 8
...so yeah, you're right.
euclidian distance formula

Related

"ROUND" calculation can't be lower than 1.0

I need a ROUND calculation to always round up when it lands between 0 and 1 (but not when it's a value above these numbers), but can't seem to figure out how to make it work.
This is what I have currently:
=ROUND(100/DATA!H6)
try:
=IF((100/DATA!H6>0)*(100/DATA!H6<1), ROUNDUP(100/DATA!H6, ), 100/DATA!H6)

Adding values from multiple .rrd file

Problem =====>
Basically there are three .rrd which are generated for three departments.
From that we fetch three values (MIN, MAX, CURRENT) and print ins 3x3 format. There is a python script which does that.
eg -
Dept1: Min=10 Max=20 Cur=15
Dept2: Min=0 Max=10 Cur=5
Dept3: Min=10 Max=30 Cur=25
Now I want to add the values together (Min, Max, Cur) and print in one line.
eg -
Dept: Min=20 Max=60 Cur=45
Issue I am facing =====>
No matter what CDEF i write, I am breaking the graph. :(
This is the part I hate as i do not get any error message.
As far as I understand(please correct me if i am wrong) I definitely cannot store the value anywhere in my program as a graph is returned.
What would be a proper way to add the values in this condition.
Please let me know if my describing the problem is lacking more detail.
You can do this with a VDEF over a CDEF'd sum.
DEF:a=dept1.rrd:ds0:AVERAGE
DEF:b=dept2.rrd:ds0:AVERAGE
DEF:maxa=dept1.rrd:ds0:MAXIMUM
DEF:maxb=dept2.rrd:ds0:MAXIMUM
CDEF:maxall=maxa,maxb,+
CDEF:all=a,b,+
VDEF:maxalltime=maxall,MAXIMUM
VDEF:alltimeavg=all,AVERAGE
PRINT:maxalltime:Max=%f
PRINT:alltimeavg:Avg=%f
LINE:all#ff0000:AllDepartments
However, you should note that, apart form at the highest granularity, the Min and Max totals will be wrong! This is because max(a+b) != max(a) + max(b). If you dont calculate the min/max aggregate at time of storage, the granularity will be gone at time of display.
For example, if a = (1, 2, 3) and b = (3, 2, 1), then max(a) + max(b) = 6; however the maximum at any point in time is in fact 4. The same issue applies to using min(a) + min(b).

When a draw occurs when tracking most occurrences in a list how to find element with highest index?

lines = ["Pizza", "Vanilla","Los Angeles Pikes","Cookie Washington Tennis Festival","Water Fiesta","Watermelon"]
best= max(set(lines), key=lines.count)
print (best)
The code above returns the greatest occurrence of an element in the list, but in case there is a draw, I want it to return the element with the greatest index. So here I want Watermelon to be printed and if anything is added without a break in the tie the highest index of the draw should be printed.
I need a solution with simple basic code like that seen above and without the importing of libraries. If you could help find a good solution for this it would be really helpful.
You could add the index normalized to a value greater than the length of the array to the result of count. The normalized index will always be less than 1.0, so that it will not affect the first-order comparison, but will guarantee that there are no ties. I would use a small function to do this:
lines = ["Pizza", "Vanilla", "Los Angeles Pikes",
"Cookie Washington Tennis Festival",
"Water Fiesta", "Watermelon"]
def key(x):
return lines.count(x) + lines.index(x) / (len(lines) + 1)
best = max(set(lines), key=key)
print(best)
While your original code returned lines = "Los Angeles Pikes" in my version of Python (because of the way the hashes turned out), the new version returns "Watermelon", as expected.
You can also use a lambda, but I find that a bit harder to read:
best = max(set(lines), key=lambda x: lines.count(x) + lines.index(x) / (len(lines) + 1))

igraph invalid vertex Id

I'm trying to run igraph's fast greedy community detection algorithm using the following code:
G = Graph()
L = []
V = []
for row in cr:
try:
l = []
source = int((row[0]).strip())
target = int((row[1]).strip())
weight = int((row[2]).strip())
l.append(source)
l.append(target)
if l not in L:
L.append(l)
if source not in V:
V.append(source)
if target not in V:
V.append(target)
except ValueError:
print "Value Error"
continue
if weight == 1:
continue
G.add_vertices(max(V))
G.add_edges(L)
cl = G.community_fastgreedy(weights=weight).as_clustering(10);
But this is the error I'm getting:
igraph._igraph.InternalError: Error at type_indexededgelist.c:272: cannot add edges, Invalid vertex id
I found this: Cannot add edges, Invalid vertex ID in IGraph so I tried adding all the vertices and then all the edges but I still get an error.
Does the above code do the same thing as:
tupleMapping = []
for row in cr:
if int(row[2]) < 10:
continue
l = [row[0], row[1], row[2]]
tupleMapping.append(tuple(l))
g = Graph.TupleList(tupleMapping)
cl = g.community_fastgreedy().as_clustering(20)
I dont have to explicitly say the G.community_fastgreedy(weights=weight) right?
Also another problem I was having; when I try to add more clusters in the following way:
cl = g.community_fastgreedy().as_clustering(10)
cl = g.community_fastgreedy().as_clustering(20)
I get two large clusters and the rest of the clusters compose of one element. This happens when I try to make the cluster size 5/10/20, is there any way for me to make the clusters more equally divided? I need more than 2 clusters for my dataset.
This is a small snippet of the data I'm trying to read from the csv file so that I can generate a graph and then run the community detection algorithm:
202,580,11
87,153,7
227,459,6
263,524,11
Thanks.
That's right, the second code does the same. In the first example, the problem is that when you add edges, you refer to igraph's internal vertex IDs, which always start from 0, and go until N-1. Does not matter your own vertex names are integers, you need to translate them to igraph vertex IDs.
The igraph.Graph.TupleList() method is much more convenient here. However, you need to specify that the third element of the tuple is the weight. You can do it either by weights = True or edge_attrs = ['weight'] arguments:
import igraph
data = '''1;2;34
1;3;41
1;4;87
2;4;12
4;5;22
5;6;33'''
L = set([])
for row in data.split('\n'):
row = row.split(';')
L.add(
(row[0].strip(), row[1].strip(), int(row[2].strip()))
)
G = igraph.Graph.TupleList(L, edge_attrs = ['weight'])
You can then create dictionaries to translate between igraph vertex IDs and your original names:
vid2name = dict(zip(xrange(G.vcount()), G.vs['name']))
name2vid = dict((name, vid) for vid, name in vid2name.iteritems())
However, the first is not so much needed, as you can always use G.vs[vid]['name'].
For fastgreedy, I think you should specify the weights, at least the documentation does not tell if it automatically considers the attribute named weight if such attribute exists.
fg = G.community_fastgreedy(weights = 'weight')
fg_clust_10 = fg.as_clustering(10)
fg_clust_20 = fg.as_clustering(20)
If fastgreedy gives you only 2 large clusters, I can only recommend to try other community detection methods. Actually you could try all of them which run within reasonable time (it depends on the size of your graph), and then compare their results. Also because you have a weighted graph, you could take a look at moduland method family, which is not implemented in igraph, but has good documentation, and you can set quite sophisticated settings.
Edit: The comments from OP suggest that the original data describes a directed graph. The fastgreedy algorithm is unable to consider directions, and gives error if called on a directed graph. That's why in my example I created an undirected igraph.Graph() object. If you want to run other methods, some of those might able to deal with directed networks, you should create first a directed graph:
G = igraph.Graph.TupleList(L, directed = True, edge_attrs = ['weight'])
G.is_directed()
# returns True
To run fastgreedy, convert the graph to undirected. As you have a weight attribute for the edges, you need to specify what igraph should do when 2 edges of opposit direction between the same pair of vertices being collapsed to one undirected edge. You can do many things with the weights, like taking the mean, the larger, or the smaller one, etc. For example, to make the combined edges have a mean weight of the original edges:
uG = G.as_undirected(combine_edges = 'mean')
fg = uG.community_fastgreedy(weights = 'weight')
Important: be aware that at this operation, and also when you add or remove vertices or edges, igraph reindexes the vertices and edges, so if you know that vertex id x corresponds to your original id y, after reindexing this won't be valid anymore, you need to recreate the name2vid and vid2name dictionaries.

Generating rolling z-scores of panel data in Stata

I have an unbalanced panel data set (countries and years). For simplicity let's say I have one variable, x, that I am measuring. The panel data sorted first by country (a 3-digit numeric country-code) and then by year. I would like to write a .do file that generates a new variable, z_x, containing the standardized values of the variable x. The variables should be standardized by subtracting the mean from the preceding (exclusive) m time periods, and then dividing by the standard deviation from those same time periods. If this is not possible, return a missing value.
Currently, the code I am using to accomplish this is the following (edited now for clarity)
xtset weocountrycode year
sort weocountrycode year
local win_len = 5 // Defining rolling window length.
quietly: rolling sd_x=r(sd) mean_x=r(mean), window(`win_len') saving(stats_x, replace): sum x
use stats_x, clear
rename end year
save, replace
use all_data_PROCESSED_FINAL.dta, clear
quietly: merge 1:1 (weocountrycode year) using stats_x
replace sd_x = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n] // This and next line are for deleting values that rolling calculates when I actually want missing values.
replace mean_`x' = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n]
gen z_`x' = (`x' - mean_`x'[_n-1])/sd_`x'[_n-1] // calculate z-score
UPDATE:
My struggle with rolling is that when rolling is set up to use a window length 5 rolling mean, it automatically does window length 1,2,3,4 means for the first, second, third and fourth entries (when there are not 5 preceding entries available to average out). In fact, it does this in general - if the first non-missing value is on entry 5, it will do a length 1 rolling average on entry 5, length 2 rolling average on entry 6, ..... and then finally start doing length 5 moving averages on entry 9. My issue is that I do not want this, so I would like to avoid performing these calculations. Until now, I have only been able to figure out how to delete them after they are done, which is both inefficient and bothersome.
I tried adding an if clause to the -rolling- statement:
quietly: rolling sd_x=r(sd) mean_x=r(mean) if x[_n-`win_len'+1] != . & weocountrycode[_n-`win_len'+1] != weocountrycode[_n], window(`win_len') saving(stats_x, replace): sum x
But it did not fix the problem and the output is "weird" in the sense that
1) If `win_len' is equal to, say, 10, there are 15 missing values in the resulting z_x variable, instead of 9.
2) Even though there are "extra" missing values in z_x, the observations still start out as window length 1 means, then window length 2 means, etc. which makes no sense to me.
Which leads me to believe I fundamentally don't understand 1) what -rolling- is doing and 2) how an if clause works in the context of -rolling-.
Does this help?
Thanks!
I'm not sure I understand completely but I'll try to answer based on what I think your problem is, and based on a comment by #NickCox.
You say:
... when rolling is set up to use a window length 5 rolling mean...
if the first non-missing value is
on entry 5, it will do a length 1 rolling average on entry 5, length 2
rolling average on entry 6, ...
This is expected. help rolling states:
The window size refers to calendar periods, not the number of
observations. If there
are missing data (for example, because of weekends), the actual number of observations used by command may be less than
window(#).
It's not actually doing a "length 1 rolling average", but I get to that later.
Below some examples to see what rolling does:
clear all
set more off
*-------------------------- example data -----------------------------
set obs 92
gen dat = _n - 1
format dat %tq
egen seq = fill(1 1 1 1 2 2 2 2)
tsset dat
tempfile main
save "`main'"
list in 1/12, separator(4)
*------------------- Example 1. None missing ------------------------
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------- Example 2. All but one value, missing in first window ------
use "`main'", clear
replace seq = . in 1/3
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------------- Example 3. All missing in first window --------------
use "`main'", clear
replace seq = . in 1/4
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
Note I use the stepsize option to make things much easier to follow. Because the date variable is in quarters, I set windowsize(4) and stepsize(4) so rolling is just computing averages by year. I hope that's easy to see.
Example 1 does as expected. No problem here.
Example 2 on the other hand, should be more interesting for you. We've said that what matters are calendar periods, so the mean is computed for the whole year (four quarters), even though it contains missings. There are three missings and one non-missing. summarize is computing the mean over the whole year, but summarize ignores missings, so it just outputs the mean of non-missings, which in this case is just one value.
Example 3 has missings for all four quarters of the year. Therefore, summarize outputs . (missing).
Your problem, as I understand it, is that when you face a situation like Example 2, you'd like the output to be missing. This is where I think Nick Cox's advice comes in. You could try something like:
rolling mean=r(mean) N=r(N), window(4) stepsize(4) clear: summarize seq, detail
replace mean = . if N != 4
list in 1/12, separator(0)
This says: if the number of non-missings for the window (r(N), also computed by summarize), is not the same as the window size, then replace it with missing.