Matching dendrogram with cluster number in Python's scipy.cluster.hierarchy - python-2.7

The following code generates a simple hierarchical cluster dendrogram with 10 leaf nodes:
import scipy
import scipy.cluster.hierarchy as sch
import matplotlib.pylab as plt
X = scipy.randn(10,2)
d = sch.distance.pdist(X)
Z= sch.linkage(d,method='complete')
P =sch.dendrogram(Z)
plt.show()
I generate three flat clusters like so:
T = sch.fcluster(Z, 3, 'maxclust')
# array([3, 1, 1, 2, 2, 2, 2, 2, 1, 2])
However, I'd like to see the cluster labels 1,2,3 on the dendrogram. It's easy for me to visualize with just 10 leaf nodes and three clusters, but when I have 1000 nodes and 10 clusters, I can't see what's going on.
How do I show the cluster numbers on the dendrogram? I'm open to other packages. Thanks.

Here is a solution that appropriately colors the clusters and labels the leaves of the dendrogram with the appropriate cluster name (leaves are labeled: 'point number, cluster number'). These techniques can be used independently or together. I modified your original example to include both:
import scipy
import scipy.cluster.hierarchy as sch
import matplotlib.pylab as plt
n=10
k=3
X = scipy.randn(n,2)
d = sch.distance.pdist(X)
Z= sch.linkage(d,method='complete')
T = sch.fcluster(Z, k, 'maxclust')
# calculate labels
labels=list('' for i in range(n))
for i in range(n):
labels[i]=str(i)+ ',' + str(T[i])
# calculate color threshold
ct=Z[-(k-1),2]
#plot
P =sch.dendrogram(Z,labels=labels,color_threshold=ct)
plt.show()

Related

How to plot graph from file using Python, problem of the junction of lines

I'm new to python and have a question. I have a file.csv file that contains two columns.
FILE.csv
0.0000 9.0655
0.0048 9.0640
0.0096 9.0592
0.0144 9.0510
0.0192 9.0392
0.0240 9.0233
0.0288 9.0028
0.0336 8.9770
0.0384 8.9451
0.0432 8.9063
0.0480 8.8595
0.0528 8.8039
0.0576 8.7385
0.0624 8.6626
0.0000 11.0013
0.0048 11.0018
0.0096 11.0032
0.0144 11.0057
0.0192 11.0091
0.0240 11.0134
0.0288 11.0186
0.0336 11.0247
0.0384 11.0317
0.0432 11.0394
0.0480 11.0478
0.0528 11.0569
0.0576 11.0666
0.0624 11.0767
0.0672 11.0873
I tried to plot the graph from FILE.csv
with xmgrace and Gnuplot, and the result is very convincing.
I have two lines in the graph, as shown in the two figure below:
enter image description here
enter image description here
On the other hand, if I use my python script, the two lines are joined
here is my script:
import matplotlib.pyplot as plt
import pylab as plt
#
with open('bb.gnu') as f:
f=[x.strip() for x in f if x.strip()]
data=[tuple(map(float,x.split())) for x in f[2:]]
BX1=[x[0] for x in data]
BY1=[x[1] for x in data]
plt.figure(figsize=(8,6))
ax = plt.subplot(111)
plt.plot(BX1, BY1, 'k-', linewidth=2 ,label='Dos')
plt.plot()
plt.savefig("Fig.png", dpi=100)
plt.show()
And here's the result
enter image description here
My question, does it exist a solution to plot graph with Python, without generating the junction between the two lines.
In order to find a similar result to Gnuplot and xmgrace.
Thank you in advance for your help.
To my knowledge, matplotlib is only joining your two curves because you provide them as one set of data. This means that you need to call plot twice in order to generate two curves. I put your data in a file called data.csv and wrote the following piece of code:
import numpy
import matplotlib.pyplot as plt
data = numpy.genfromtxt('data.csv')
starts = numpy.asarray(data[:, 0] == 0).nonzero()[0]
fig, ax = plt.subplots(nrows=1, ncols=1, num=0, figsize=(16, 8))
for i in range(starts.shape[0]):
if i == starts.shape[0] - 1:
ax.plot(data[starts[i]:, 0], data[starts[i]:, 1])
else:
ax.plot(data[starts[i]:starts[i + 1], 0],
data[starts[i]:starts[i + 1], 1])
plt.show()
which generates this figure
What I do with starts is that I look for the rows in the first column of data which contain the value 0, which I consider to be the start of a new curve. The loop then generates a curve at each iteration. The if statement discerns between the last curve and the other ones. There is probably more elegant, but it works.
Also, do not import pylab, it is discouraged because of the unnecessary filling of the namespace.

PySpark Using collect_list to collect Arrays of Varying Length

I am attempting to use collect_list to collect arrays (and maintain order) from two different data frames.
Test_Data and Train_Data have the same format.
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('Group').orderBy('date')
# Train_Data has 4 data points
# Test_Data has 7 data points
# desired target array: [1, 1, 2, 3]
# desired MarchMadInd array: [0, 0, 0, 1, 0, 0, 1]
sorted_list_diff_array_lens = train_data.withColumn('target',
F.collect_list('target').over(w)
)\
test_data.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))\
.groupBy('Group')\
.agg(F.max('target').alias('target'),
F.max('MarchMadInd').alias('MarchMadInd')
)
I realize the syntax is incorrect with "test_data.withColumn", but I want to select the array for the MarchMadInd from the test_date, but the array for the target from the train_data. The desired output would look like the following:
{"target":[1, 1, 2, 3], "MarchMadInd":[0, 0, 0, 1, 0, 0, 1]}
Context: this is for a DeepAR time series model (using AWS) that requires dynamic features to include the prediction period, but the target should be historical data.
The solution involves using a join as recommended by pault.
Create a dataframe with dynamic features of length equal to Training + Prediction period
Create a dataframe with target values of length equal to just the Training period.
Use a LEFT JOIN (with the dynamic feature data on LEFT) to bring these dataframes together
Now, using collect_list will create the desired result.

Integrating an array in scipy with bounds.

I am trying to integrate over an array of data, but with bounds. Therfore I planned to use simps (scipy.integrate.simps). Because simps itself does not support bounds I decided to feed it only the selection of my data I want to integrate over. Yet this leads to strange results which are twice as big as the expected outcome.
What am I doing wrong, or what am I missing, or missunderstanding?
# -*- coding: utf-8 -*-
from scipy import integrate
from scipy import interpolate
import numpy as np
import matplotlib.pyplot as plt
# my data
x = np.linspace(-10, 10, 30)
y = x**2
# but I only want to integrate from 3 to 5
f = interpolate.interp1d(x, y)
x_selection = np.linspace(3, 5, 10)
y_selection = f(x_selection)
# quad returns the expected result
print 'quad', integrate.quad(f, 3, 5), '<- the expected value (includig error estimation)'
# but simps returns an uexpected result, when using the selected data
print 'simps', integrate.simps(x_selection, y_selection), '<- twice as big'
print 'trapz', integrate.trapz(x_selection, y_selection), '<- also twice as big'
plt.plot(x, y, marker='.')
plt.fill_between(x, y, 0, alpha=0.5)
plt.plot(x_selection, y_selection, marker='.')
plt.fill_between(x_selection, y_selection, 0, alpha=0.5)
plt.show()
Windows7, python2.7, scipy1.0.0
The Arguments for simps() and trapz() are in the wrong order.
You have flipped the calling arguments; simps and trapz expect first the y dimension, and second the x dimension, as per the docs. Once you have corrected this, similar results should obtain. Note that your example function admits a trivial analytic antiderivative, which would be much cheaper to evaluate.
– N. Wouda

matplotlib: Plot multiple small figures in one big plot

I have a pandas dataframe pandas_df with 6 input columns: column_1, column_2, ... , column_6, and one result column result. Now I used the following code to plot the scatter plot for every two input column pairs (so totally I have 6*5/2 = 15 figures). I did the following code 15 times, and each generated a big figure.
I am wondering is there a way to iterate over all possible column pairs, and plot all 15 figures as small figures in one big plot? Thanks!
%matplotlib notebook
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
pandas_df.plot(x='column_1', y='column_2', kind = 'scatter', c = 'result')
consider the dataframe df
df = pd.DataFrame(np.random.rand(10, 6), columns=pd.Series(list('123456')).radd('C'))
df
Solution
Use itertools and matplotlib.pyplot.subplots
from itertools import combinations
import matplotlib.pyplot as plt
pairs = list(combinations(df.columns, 2))
fig, axes = plt.subplots(len(pairs) // 3, 3, figsize=(15, 12))
for i, pair in enumerate(pairs):
d = df[list(pair)]
ax = axes[i // 3, i % 3]
d.plot.scatter(*pair, ax=ax)
fig.tight_layout()

Plotting error bars from 2 axis

I'm looking to plot the standard deviation of some array data I've been looking at in python however the data is averaged over both longitude and latitude (Axis 2,3 of my arrays).
What I have so far is a monthly plot that looks like this but I can't get the standard deviations to work Monthly plot
I was just wondering if anyone knew how to get around this problem. Here's the code I've used thus far.
Any help is much appreciated!
# import things
import matplotlib.pyplot as plt
import numpy as np
import netCDF4
# [ date, hour, 0, lon, lat ]
temp = (f.variables['TEMP2'][:, 14:24, 0, :, :]) # temp at 2m
temp2 = (f.variables['TEMP2'][:, 0:14, 0, :, :])
# concatenate back to 24 hour period
tercon = np.concatenate((temp, temp2), axis=1)
ter1 = tercon.mean(axis=(2, 3))
rtemp = np.reshape(ter1, 672)-273
# X axis dates instead of times
date = np.arange(rtemp.shape[0]) # assume that delta time between data is 1
date21 = (date/24.) # use days instead of hours
# change plot size for monthly
rcParams['figure.figsize'] = 15, 5
plt.plot(date21, rtemp , linestyle='-', linewidth=3.0, c='orange')
You should errorbar instead of plot and pass the precalculated standard deviations. The following adapted example uses random data to emulate your temperature data with an hourly resolution and accumulates the data and the standard deviation.
# import things
import matplotlib.pyplot as plt
import numpy as np
# x-axis: day-of-month
date21 = np.arange(1, 31)
# generate random "hourly" data
hourly_temp = np.random.random(30*24)*10 + 20
# mean "temperature"
dayly_mean_temp = hourly_temp.reshape(24,30).mean(axis=0)
# standard deviation per day
dayly_std_temp = hourly_temp.reshape(24,30).std(axis=0)
# create a figure
figure = plt.figure(figsize = (15, 5))
#add an axes to the figure
ax = figure.add_subplot(111)
ax.grid()
ax.errorbar(date21, dayly_mean_temp , yerr=dayly_std_temp, fmt="--o", capsize=15, capthick=3, linestyle='-', linewidth=3.0, c='orange')
plt.show()