Data has same value for every dimension after PCA - pca

I encountered a bug(?) after performing PCA on a big dataset. I have ca. 2000 measurements and ca. 50 features / dimensions. I perform PCA to reduce the number of dimensions. I want to have only 20-30 dimensions. But my data does look strange after I project it into new PCA feature space. Every dimension has the same values, except for the first. It doesnt matter how many dimensions I set for PCA, my data does always look like this: (three dimensions as example and four measurements)
10075.1;2.00177e-23;7.70922e-43
10114.6;2.00177e-23;7.70922e-43
10192.9;2.00177e-23;7.70922e-43
9843.2;2.00177e-23;7.70922e-43
What is the reason? Why do I have good data only for the first feature?
This is the original data:
0;24;54;167;19.3625;46;24;21;298.575;254.743;1.17207;1.73611;2.26757;18;15;14;12;9;8;4;15;13;12;9;8;4;33;28;26;21;17;15;8;0;0;1;92283.9;19441.8;16337;11731.8;6796.85;2215.39;1861.07;3516.91;4587.27;4130.99;7.38638;8;9.41167;10.5923;14;19.9733
0;24;54;167;19.3625;45;23;21;272.609;244.143;1.11659;1.89036;2.26757;17;15;14;11;9;7;4;16;13;12;9;8;4;33;28;26;20;17;14;8;0;1;1;92298.5;19414.8;16445.3;11871.4;6873.36;2071.48;1845.56;4483;4588.43;2854.95;7.06929;8;9.08176;10.0947;14;19.1412
0;24;54;167;19.3625;45;23;21;256.58;248.081;1.03426;1.89036;2.26757;17;15;14;11;9;7;4;15;13;12;9;8;4;32;28;26;20;17;14;8;0;1;1;92262.9;19449.6;16602.1;12066.9;6875.38;1762.22;1813.8;4461.31;4605.87;4540.53;6.72761;7;9.17784;10.0404;14;19.0638
0;24;54;167;19.3625;45;22;23;228.664;293.1;0.780157;2.06612;1.89036;16;14;13;10;8;7;4;17;14;13;10;8;3;33;28;26;20;16;13;7;1;0;0;92047.3;19594.2;16615.9;11855.3;6357.26;1412.1;1931.18;3292.93;4305.41;3125.78;7.14206;7;9.15515;10.0013;14;18.9998
Here are the eigenvalues and eigenvectors:
120544647.296627;
1055287.207309433;
788517.1814841435
4.445188101138883e-06, -1.582751359550716e-06, 0.0001194540407426801, 8.805619419232736e-05, 1.718812629108742e-05, -6.478627494871924e-06, 1.866065159173557e-06, -8.102268773738454e-06, 0.001575116366026065, 0.001368858662087531, 2.42338448583798e-06, 1.468791084230193e-07, 1.619495879919206e-08, 2.045676050284675e-06, 4.522426974955079e-06, 1.935642018365442e-06, 9.400348593348646e-07, 3.50785209102226e-06, -6.886458171608557e-07, -2.272864941126205e-06, -4.576437628645375e-06, -3.711985547436847e-06, -4.179746481364989e-06, -1.080958836802159e-06, 3.018347636693104e-06, -5.401065369031065e-08, -1.776343529071431e-06, -3.239711622030108e-06, 2.426893254220096e-06, 2.329701819532251e-06, -1.335049163771412e-06, -2.016447535744125e-06, -2.48848684914049e-06, 1.034821043317487e-06, 0.9509463574053698, 0.2040750414336948, 0.1698045366243798, 0.1221511665292666, 0.06648621927929886, 0.01787357780337607, 0.02181878649610538, 0.04094056949392437, 0.04589005034245261, 0.03602144595540402, 4.638015609510389e-05, -9.594011737623517e-07, 5.643329708389021e-05, 6.49999142971481e-05, 6.708699420903862e-07, 0.0001209291154324417;
-1.193874321738139e-05, -3.042062337012123e-05, -0.0001368023572559274, -0.0001093928140002418, -1.847065231448535e-05, 3.847106756849437e-05, -1.23803319528626e-05, 2.082402112096706e-06, -0.002107941678699949, -0.0007526438176676972, -1.304240623192574e-06, -4.358106348750469e-06, 4.189661461745327e-06, 3.972537960568455e-07, 5.415441896012467e-06, -3.487031299718403e-06, -3.082927770719131e-06, -6.180776247962886e-06, -3.293811231853141e-06, -3.069190535161948e-06, 9.242946297782889e-06, 1.849824602072292e-06, 8.007250998398399e-06, 9.597348504390614e-06, -7.976030386807306e-07, 1.465838819379542e-05, -1.637206697646072e-06, 4.924323227679534e-06, 3.416572256427778e-06, -4.091414270533951e-06, 3.950956777004832e-06, -1.425709512894606e-05, -1.612907157276045e-06, -1.656147283798045e-06, 0.01791626179130883, -0.03865588909604983, -0.02237813174629856, -0.011581970882016, 0.008401303497694863, 0.00598682750741207, -0.02647921936520565, -0.08745349044258101, -0.6199482703379527, 0.7776587660292456, -2.204501859699998e-05, 3.065799954216684e-06, -0.0001088757748474737, -9.070630703475932e-05, -1.507680849966721e-05, -0.000203298163659711;
2.141350692234778e-05, 3.763794188497906e-05, 0.0002682046623337108, 0.0002761646438217766, 2.250001958053043e-05, -4.493680340744517e-05, 1.71038513853044e-05, 4.793887034272248e-05, -0.002472775598056956, -0.002583273192861402, -2.360815196252781e-05, 8.57575614248591e-07, -2.277442903271404e-06, -9.431493206768549e-06, 2.836934896747011e-06, 1.836715455464421e-05, 2.384241283455247e-05, 4.963711569589484e-06, 1.390892651258379e-05, 2.354454084909798e-05, 2.358174073858803e-05, 3.953694936818999e-05, 3.859322887829735e-05, 4.383431246805508e-06, 9.501429817743515e-06, 2.641867563533516e-05, 5.790410392283418e-05, 6.243564171284964e-05, 9.347142816394926e-06, 2.341035633032736e-05, 3.140572721234472e-05, 2.567884918875704e-06, -2.488581283389154e-06, -1.083945623896245e-05, -0.02381539022135584, 0.1464545802416884, 0.09922198413600333, 0.009864006965697942, -0.07588888859083308, -0.1732512868035658, 0.2074803672415529, 0.5543971362454099, -0.6344797023718978, -0.4234201679790431, -0.0001368109107852992, 2.172633922404158e-07, -0.0001132510107743674, -7.90184051908068e-05, 1.89704719379068e-05, -0.0001862727476251848
I thought the reason why there is so much variance for the first feature was because the first eigenvalues is very big in comparison with the other two. When I normalize my data before PCA, I get very similar eigenvalues:
0.6660936495675316;
0.6449413383086006;
0.383110906838073
But the data still looks similar after projecting it in PCA space:
-0.816894;7.1333e-67;2.00113e-23
-0.822324;7.1333e-67;2.00113e-23
-0.831973;7.1333e-67;2.00113e-23
-0.822553;7.1333e-67;2.00113e-23

The problem is that all your data for features 2, 3, and 4 are very close or exactly the same as the first feature, which is why your results are not very great. The magnitude of the differences may also not be significant enough to capture the variance of the data.
PCA works by getting the covariances between the features. You might want to check out the covariance matrix produced by PCA. I suspect that all the values are very close to each other. Most of the variance is captured by the first eigenvector of the matrix.

Related

Divide the testing set into subgroup, then make prediction on each subgroup separately

I have a dataset similar to the following table:
The prediction target is going to be the 'score' column. I'm wondering how can I divide the testing set into different subgroups such as score between 1 to 3 or then check the accuracy on each subgroup.
Now what I have is as follows:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = tree.DecisionTreeRegressor()
model.fit(X_train, y_train)
for i in (0,1,2,3,4):
y_new=y_test[(y_test>=i) & (y_test<=i+1)]
y_new_pred=model.predict(X_test)
print metrics.r2_score(y_new, y_new_pred)
However, my code did not work and this is the traceback that I get:
Found input variables with inconsistent numbers of samples: [14279,
55955]
I have tried the solution provided below, but it looks like that for the full score range (0-5) the r^2 is 0.67. but the subscore range for example (0-1,1-2,2-3,3-4,4-5) the r^2s are significantly lower than that of the full range. shouldn't some of the subscore r^2 be higher than 0.67 and some of them be lower than 0.67?
Could anyone kindly let me know where did I do wrong? Thanks a lot for all your help.
When you are computing the metrics, you have to filtered the predicted values (based on your subset condition).
Basically you are trying to compute
metrics.r2_score([1,3],[1,2,3,4,5])
which creates an error,
ValueError: Found input variables with inconsistent numbers of
samples: [2, 5]
Hence, my suggested solution would be
model.fit(X_train, y_train)
#compute the prediction only once.
y_pred = model.predict(X_test)
for i in (0,1,2,3,4):
#COMPUTE THE CONDITION FOR SUBSET HERE
subset = (y_test>=i) & (y_test<=i+1)
print metrics.r2_score(y_test [subset], y_pred[subset])

Parseval's Theorem does not hold for FFT of a sinusoid + noise?

Thanks in advance for any help on this subject. I've recently been trying to work out Parseval's theorem for discrete fourier transforms when noise is included. I based my code from this code.
What I expected to see is that (as when no noise is included) the total power in the frequency domain is half that of the total power in the time-domain, as I have cut off the negative frequencies.
However, as more noise is added to the time-domain signal, the total power of the fourier transform of the signal+noise becomes much less than half of the total power of the signal+noise.
My code is as follows:
import numpy as np
import numpy.fft as nf
import matplotlib.pyplot as plt
def findingdifference(randomvalues):
n = int(1e7) #number of points
tmax = 40e-3 #measurement time
f1 = 30e6 #beat frequency
t = np.linspace(-tmax,tmax,num=n) #define time axis
dt = t[1]-t[0] #time spacing
gt = np.sin(2*np.pi*f1*t)+randomvalues #make a sin + noise
fftfreq = nf.fftfreq(n,dt) #defining frequency (x) axis
hkk = nf.fft(gt) # fourier transform of sinusoid + noise
hkn = nf.fft(randomvalues) #fourier transform of just noise
fftfreq = fftfreq[fftfreq>0] #only taking positive frequencies
hkk = hkk[fftfreq>0]
hkn = hkn[fftfreq>0]
timedomain_p = sum(abs(gt)**2.0)*dt #parseval's theorem for time
freqdomain_p = sum(abs(hkk)**2.0)*dt/n # parseval's therom for frequency
difference = (timedomain_p-freqdomain_p)/timedomain_p*100 #percentage diff
tdomain_pn = sum(abs(randomvalues)**2.0)*dt #parseval's for time
fdomain_pn = sum(abs(hkn)**2.0)*dt/n # parseval's for frequency
difference_n = (tdomain_pn-fdomain_pn)/tdomain_pn*100 #percent diff
return difference,difference_n
def definingvalues(max_amp,length):
noise_amplitude = np.linspace(0,max_amp,length) #defining noise amplitude
difference = np.zeros((2,len(noise_amplitude)))
randomvals = np.random.random(int(1e7)) #defining noise
for i in range(len(noise_amplitude)):
difference[:,i] = (findingdifference(noise_amplitude[i]*randomvals))
return noise_amplitude,difference
def figure(max_amp,length):
noise_amplitude,difference = definingvalues(max_amp,length)
plt.figure()
plt.plot(noise_amplitude,difference[0,:],color='red')
plt.plot(noise_amplitude,difference[1,:],color='blue')
plt.xlabel('Noise_Variable')
plt.ylabel(r'Difference in $\%$')
plt.show()
return
figure(max_amp=3,length=21)
My final graph looks like this figure. Am I doing something wrong when working this out? Is there an physical reason that this trend occurs with added noise? Is it to do with doing a fourier transform on a not perfectly sinusoidal signal? The reason I am doing this is to understand a very noisy sinusoidal signal that I have real data for.
Parseval's theorem holds in general if you use the whole spectrum (positive and negative) frequencies to compute the power.
The reason for the discrepancy is the DC (f=0) component, which is treated somewhat special.
First, where does the DC component come from? You use np.random.random to generate random values between 0 and 1. So on average you raise the signal by 0.5*noise_amplitude, which entails a lot of power. This power is correctly computed in the time domain.
However, in the frequency domain, there is only a single FFT bin that corresponds to f=0. The power of all other frequencies is distributed over two bins, only the DC power is contained in a single bin.
By scaling the noise you add DC power. By removing the negative frequencies you remove half the signal power, but most of the noise power is located in the DC component which is used fully.
You have several options:
Use all frequencies to compute the power.
Use noise without a DC component: randomvals = np.random.random(int(1e7)) - 0.5
"Fix" the power calculation by removing half of the DC power: hkk[fftfreq==0] /= np.sqrt(2)
I'd go with option 1. The second might be OK and I don't really recommend 3.
Finally, there is a minor problem with the code:
fftfreq = fftfreq[fftfreq>0] #only taking positive frequencies
hkk = hkk[fftfreq>0]
hkn = hkn[fftfreq>0]
This does not really make sense. Better change it to
hkk = hkk[fftfreq>=0]
hkn = hkn[fftfreq>=0]
or completely remove it for option 1.

subplots only plotting 1 plot using pandas

I am trying to get two plots on one figure using matplotlib's subplots() command. I want the two plots to share an x-axis and have one legend for the whole plot. The code I have right now is:
observline = mlines.Line2D([], [], color=(1,0.502,0),\
markersize=15, label='Observed',linewidth=2)
wrfline=mlines.Line2D([], [], color='black',\
markersize=15, label='WRF',linewidth=2)
fig,axes=plt.subplots(2,1,sharex='col',figsize=(18,10))
df08.plot(ax=axes[0],linewidth=2, color=(1,0.502,0))\
.legend(handles=[observline,wrfline],loc='lower center', bbox_to_anchor=(0.9315, 0.9598),prop={'size':16})
axes[0].set_title('WRF Model Comparison Near %.2f,%.2f' %(lat,lon),fontsize=24)
axes[0].set_ylim(0,360)
axes[0].set_yticks(np.arange(0,361,60))
df18.plot(ax=axes[1],linewidth=2, color='black').legend_.remove()
plt.subplots_adjust(hspace=0)
axes[1].set_ylim(0,360)
axes[1].set_yticks(np.arange(0,361,60))
plt.ylabel('Wind Direction [Degrees]',fontsize=18,color='black')
axes[1].yaxis.set_label_coords(-0.05, 1)
plt.xlabel('Time',fontsize=18,color='black')
#plt.savefig(df8graphfile, dpi = 72)
plt.show()
and it produces four figures, each with two subplots. The top is always empty. The bottom is filled for three of them with my 2nd dataframe. The indices for each dataframe is a datetimeindex in the format YYYY-mm-DD HH:MM:SS. The data is values from 0-360 nearly randomly across the whole time series, which is for two months.
Here is an example of each figure produced:

Fitting a Gaussian, getting a straight line. Python 2.7

As my title suggests, I'm trying to fit a Gaussian to some data and I'm just getting a straight line. I've been looking at these other discussion Gaussian fit for Python and Fitting a gaussian to a curve in Python which seem to suggest basically the same thing. I can make the code in those discussions work fine for the data they provide, but it won't do it for my data.
My code looks like this:
import pylab as plb
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy import asarray as ar,exp
y = y - y[0] # to make it go to zero on both sides
x = range(len(y))
max_y = max(y)
n = len(y)
mean = sum(x*y)/n
sigma = np.sqrt(sum(y*(x-mean)**2)/n)
# Someone on a previous post seemed to think this needed to have the sqrt.
# Tried it without as well, made no difference.
def gaus(x,a,x0,sigma):
return a*exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = curve_fit(gaus,x,y,p0=[max_y,mean,sigma])
# It was suggested in one of the other posts I looked at to make the
# first element of p0 be the maximum value of y.
# I also tried it as 1, but that did not work either
plt.plot(x,y,'b:',label='data')
plt.plot(x,gaus(x,*popt),'r:',label='fit')
plt.legend()
plt.title('Fig. 3 - Fit for Time Constant')
plt.xlabel('Time (s)')
plt.ylabel('Voltage (V)')
plt.show()
The data I am trying to fit is as follows:
y = array([ 6.95301373e+12, 9.62971320e+12, 1.32501876e+13,
1.81150568e+13, 2.46111132e+13, 3.32321345e+13,
4.45978682e+13, 5.94819771e+13, 7.88394616e+13,
1.03837779e+14, 1.35888594e+14, 1.76677210e+14,
2.28196006e+14, 2.92781632e+14, 3.73133045e+14,
4.72340762e+14, 5.93892782e+14, 7.41632194e+14,
9.19750269e+14, 1.13278296e+15, 1.38551838e+15,
1.68291212e+15, 2.02996957e+15, 2.43161742e+15,
2.89259207e+15, 3.41725793e+15, 4.00937676e+15,
4.67187762e+15, 5.40667931e+15, 6.21440313e+15,
7.09421973e+15, 8.04366842e+15, 9.05855930e+15,
1.01328502e+16, 1.12585509e+16, 1.24257598e+16,
1.36226443e+16, 1.48356404e+16, 1.60496345e+16,
1.72482199e+16, 1.84140400e+16, 1.95291969e+16,
2.05757166e+16, 2.15360187e+16, 2.23933053e+16,
2.31320228e+16, 2.37385276e+16, 2.42009864e+16,
2.45114362e+16, 2.46427484e+16, 2.45114362e+16,
2.42009864e+16, 2.37385276e+16, 2.31320228e+16,
2.23933053e+16, 2.15360187e+16, 2.05757166e+16,
1.95291969e+16, 1.84140400e+16, 1.72482199e+16,
1.60496345e+16, 1.48356404e+16, 1.36226443e+16,
1.24257598e+16, 1.12585509e+16, 1.01328502e+16,
9.05855930e+15, 8.04366842e+15, 7.09421973e+15,
6.21440313e+15, 5.40667931e+15, 4.67187762e+15,
4.00937676e+15, 3.41725793e+15, 2.89259207e+15,
2.43161742e+15, 2.02996957e+15, 1.68291212e+15,
1.38551838e+15, 1.13278296e+15, 9.19750269e+14,
7.41632194e+14, 5.93892782e+14, 4.72340762e+14,
3.73133045e+14, 2.92781632e+14, 2.28196006e+14,
1.76677210e+14, 1.35888594e+14, 1.03837779e+14,
7.88394616e+13, 5.94819771e+13, 4.45978682e+13,
3.32321345e+13, 2.46111132e+13, 1.81150568e+13,
1.32501876e+13, 9.62971320e+12, 6.95301373e+12,
4.98705540e+12])
I would show you what it looks like, but apparently I don't have enough reputation points...
Anyone got any idea why it's not fitting properly?
Thanks for your help :)
The importance of the initial guess, p0 in curve_fit's default argument list, cannot be stressed enough.
Notice that the docstring mentions that
[p0] If None, then the initial values will all be 1
So if you do not supply it, it will use an initial guess of 1 for all parameters you're trying to optimize for.
The choice of p0 affects the speed at which the underlying algorithm changes the guess vector p0 (ref. the documentation of least_squares).
When you look at the data that you have, you'll notice that the maximum and the mean, mu_0, of the Gaussian-like dataset y, are
2.4e16 and 49 respectively. With the peak value so large, the algorithm, would need to make drastic changes to its initial guess to reach that large value.
When you supply a good initial guess to the curve fitting algorithm, convergence is more likely to occur.
Using your data, you can supply a good initial guess for the peak_value, the mean and sigma, by writing them like this:
y = np.array([...]) # starting from the original dataset
x = np.arange(len(y))
peak_value = y.max()
mean = x[y.argmax()] # observation of the data shows that the peak is close to the center of the interval of the x-data
sigma = mean - np.where(y > peak_value * np.exp(-.5))[0][0] # when x is sigma in the gaussian model, the function evaluates to a*exp(-.5)
popt,pcov = curve_fit(gaus, x, y, p0=[peak_value, mean, sigma])
print(popt) # prints: [ 2.44402560e+16 4.90000000e+01 1.20588976e+01]
Note that in your code, for the mean you take sum(x*y)/n , which is strange, because this would modulate the gaussian by a polynome of degree 1 (it multiplies a gaussian with a monotonously increasing line of constant slope) before taking the mean. That will offset the mean value of y (in this case to the right). A similar remark can be made for your calculation of sigma.
Final remark: the histogram of y will not resemble a Gaussian, as y is already a Gaussian. The histogram will merely bin (count) values into different categories (answering the question "how many datapoints in y reach a value between [a, b]?").

How does scikit's cross validation work?

I have the following snippet:
print '\nfitting'
rfr = RandomForestRegressor(
n_estimators=10,
max_features='auto',
criterion='mse',
max_depth=None,
)
rfr.fit(X_train, y_train)
# scores
scores = cross_val_score(
estimator=rfr,
X=X_test,
y=y_test,
verbose=1,
cv=10,
n_jobs=4,
)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
1) Does running the cross_val_score do more training on the regressor?
2) Do I need to pass in a trained regressor or just a new one, e.g. estimator=RandomForestRegressor(). How then do I test the accuracy of a regressor, i.e. must I use another function in scikit?
3) My accuracy is about 2%. Is that the MSE score, where lower is better or is it the actual accuracy. If it is the actual accuracy, can you explain it, because it doesn't make sense how a regressor will accurately predict on a range.
It re-trains the estimator, k times in fact.
Untrained (or trained, but then the model is deleted and you're just wasting time).
It's the R² score, so that's not actually 2% but .02; R² is capped at 1 but can be negative. Accuracy is not well-defined for regression. (You can define it as for classification, but that makes no sense.)