How to normalize the data - data-mining

Normalize the data set to make the norm of each data point equal to 1.
x1 (1.5,1.7) [x1 (i,j)]
x2 (2,1.9)
x3 (1.6,1.8)
x4 (1.2,1.5)
x5 (1.5,1.0)
Given a new data point, x = (1.4; 1.6) as a query,
The solution after normalization
x(0.6585,0.7526)
x1(0.6616,0.7498 )
x2(0.7250,0.6887)
x3(0.6644,0.7474)
x4(0.6247,0.7809)
x5(0.8321,0.5547)
But iam confused how the solution is obtained, i tried with different formula's none of them worked.

x(1.4,1.6) . norm(x)= sqrt( (1.4)^2 +(1.6)^2) ~ 2.13.
Normalized x is(1.4 /2.13 , 1.6/2.13)
It will work.

You have been trying column wise normalization.
But the text demands a normalization to unit length.

Related

MHD Shock Waves

I need to solve these six equations (https://en.wikipedia.org/wiki/Shocks_and_discontinuities_(magnetohydrodynamics)) for MHD shock waves condition using scipy.fsolve. There are 6 unknown variables whose values need to be stored in the array z. While I am getting correct answers for some cases, in most cases the values are wrong. They vary according to the initial guess, what would be the correct initial guess so that it works for all cases.
Note- z[0] and z[1] denote pressure and density and can't be negative. So a correct answer will give positive values.
from scipy.optimize import fsolve, root
from scipy.constants import mu_0 as w
print("Assume the shock propogates in the x direction")
p1=float(input("Upstream pressure-"))
rho1=float(input("Upstream density-"))
vx1=float(input("Upstream velocity along x axis-"))
vy1=float(input("Upstream velocity along y axis-"))
bx1=float(input("Upstream magnetic field along x axis-"))
by1=float(input("Upstream magnetic field along y axis-"))
def eqn(x): #jump conditions
f1=(x[1]*x[2])-(rho1*vx1)
f2=x[0]+(x[1]*x[2]*x[2])+(x[5]*x[5]*0.5/w)-p1-(rho1*vx1*vx1)-by1*by1*0.5/w
f3=(x[1]*x[2]*x[3]-x[4]*x[5]/w)-rho1*vx1*vy1+bx1*by1/w
f4= x[4]-bx1
f5=x[2]*x[5]-x[4]*x[3]-vx1*by1+vy1*bx1
f6=x[1]*x[2]*(0.5*(x[2]**2+x[3]**2) +2.5*x[0]/x[1]) +x[2]*(x[5]**2)/w -x[4]*x[5]*x[3]/w - rho1*vx1*(0.5*(vx1**2+vy1**2) +2.5*p1/rho1) -(vx1*by1**2)/w + bx1*by1*vy1/w
return(f1,f2,f3,f4,f5,f6)
y=[2*p1,4*rho1,0.5,0.5,bx1*2,by1*2] #initial guess value
z=fsolve(eqn,y)
print '\n','Upstream Pressure- ',p1,'\t'*3,'Downstream Pressure- ',z[0]
print 'Upstream Density- ',rho1,'\t'*4,'Downstream Density- ',z[1]
print 'Upstream velocity along x axis- ',vx1,'\t'*2,'Downstream velocity along x axis- ',z[2]
print 'Upstream velocity along y axis- ',vy1,'\t'*2,'Downstream velocity along y axis- ',z[3]
print 'Upstream magnetic field along x axis- ',bx1,'\t','Downstream magnetic field along x axis- ',z[4]
print 'Upstream magnetic field along y axis- ',by1,'\t','Downstream magnetic field along y axis- ',z[5]

pywavelet signal reconstruction

I am trying to understand the concept of wavelets using the pywavelet library. My first step was to see how I could reconstruct a given input signal using the wavelet coefficients. Please see my code below:
db1 = pywt.Wavelet('db1')
cA6, cD6,cD5, cD4, cD3, cD2, cD1=pywt.wavedec(data, db1, level=6)
cA6cD_approx = pywt.upcoef('a',cA6,'db1',take=n, level=6) + pywt.upcoef('d',cD1,'db1',take=n, level=6)\
+pywt.upcoef('d',cD2,'db1',take=n, level=6) + pywt.upcoef('d',cD3,'db1',take=n, level=6) + \
pywt.upcoef('d',cD4,'db1',take=n, level=6) + pywt.upcoef('d',cD5,'db1',take=n, level=6) + \
pywt.upcoef('d',cD6,'db1',take=n, level=6)
plt.figure(figsize=(28,10))
p1, =plt.plot(t, cA6cD_approx,'r')
p2, =plt.plot(t, data, 'b')
plt.xlabel('Day')
plt.ylabel('Number of units sold')
plt.legend([p2,p1], ["original signal", "cA6+cD* reconstructed"])
plt.show()
This yielded the following plot:
Now, when I used the waverec() method, the signal reconstruction was quite accurate. Please see plot below:
Can someone please explain the difference between the two reconstruction methods?
They are both Inverse Discrete Wavelet Transform "upcoef" is a direct reconstruction using the coefficients while "waverec" is a Multilevel 1D Inverse Discrete Wavelet Transform, doing pretty much the same thing, but doing it in a way that allows you to line up your coefficients and be more efficient when developing.
I changed a little bit, especially the setting for "level". From the plot, you will see two ways of reconstruct will produce the same result.
import numpy as np
import pywt
import matplotlib.pyplot as plt
data = np.loadtxt('Mysample_test.txt')
n = len(data)
wl = pywt.Wavelet("db1")
coeff_all = pywt.wavedec(data, wl, level=6)
cA6, cD6,cD5, cD4, cD3, cD2, cD1= coeff_all
omp0 = pywt.upcoef('a',cA6,wl,level=6)[:n]
omp1 = pywt.upcoef('d',cD1,wl,level=1)[:n]
omp2 = pywt.upcoef('d',cD2,wl,level=2)[:n]
omp3 = pywt.upcoef('d',cD3,wl,level=3)[:n]
omp4 = pywt.upcoef('d',cD4,wl,level=4)[:n]
omp5 = pywt.upcoef('d',cD5,wl,level=5)[:n]
omp6 = pywt.upcoef('d',cD6,wl,level=6)[:n]
#cA6cD_approx = omp0 + omp1 + omp2 + omp3 + omp4+ omp5 + omp6
#plt.figure(figsize=(18,9))
recon = pywt.waverec(coeff_all, wavelet= wl)
p1, =plt.plot(omp0 + omp6 + omp5 + omp4 + omp3 + omp2 + omp1,'r')
p2, =plt.plot(data, 'b')
p3, =plt.plot(recon, 'y')
plt.xlabel('Day')
plt.ylabel('Number of units sold')
plt.legend([p3,p2,p1], ["waverec reconstructed","original signal", "cA6+cD* reconstructed"])
plt.show()
The function wavedec performs a tree decomposition, which means a filtering followed by a downsampling (of a factor 2 for a dyadic scheme).
Both functions waverec and upcoef can lead to reconstruction.
The first one, waverec, performs a direct tree reconstruction symmetrical to what is done by wavedec, which means an upsampling followed by a filtering. At each reconstruction level (6 in your case) a summation is also performed to yield a signal with more details to be used for the next reconstruction level.
The second function, upcoef, allows to perform the independent reconstruction of a given subscale without considering the rest of the details contained in the other subscales. This is usually performed by zero padding when rebuilding the signal. In other words, upcoef can be seen like an interpolation operator.
In your case, you used upcoef to interpolate all the wavelet subscales from their decimated x-grid to the original x-grid. You then performed the summation of all the interpolated signals (only containing a defined and limited quantity of details). Because Daubechies' wavelets are orthogonal, they lead to a perfect reconstruction and this way you can get your original signal back after reconstruction.
In short:
waverec => direct reconstruction => original signal
n times upcoef => interpolation followed by a global summation => original signal
Subscales interpolation is only useful when you want to visualise all the details on the same non-decimated x-grid frame. Such an interpolation brings nothing more since the quantity of information contained in any subscale and its interpolated version is the same.

Fitting a Gaussian, getting a straight line. Python 2.7

As my title suggests, I'm trying to fit a Gaussian to some data and I'm just getting a straight line. I've been looking at these other discussion Gaussian fit for Python and Fitting a gaussian to a curve in Python which seem to suggest basically the same thing. I can make the code in those discussions work fine for the data they provide, but it won't do it for my data.
My code looks like this:
import pylab as plb
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy import asarray as ar,exp
y = y - y[0] # to make it go to zero on both sides
x = range(len(y))
max_y = max(y)
n = len(y)
mean = sum(x*y)/n
sigma = np.sqrt(sum(y*(x-mean)**2)/n)
# Someone on a previous post seemed to think this needed to have the sqrt.
# Tried it without as well, made no difference.
def gaus(x,a,x0,sigma):
return a*exp(-(x-x0)**2/(2*sigma**2))
popt,pcov = curve_fit(gaus,x,y,p0=[max_y,mean,sigma])
# It was suggested in one of the other posts I looked at to make the
# first element of p0 be the maximum value of y.
# I also tried it as 1, but that did not work either
plt.plot(x,y,'b:',label='data')
plt.plot(x,gaus(x,*popt),'r:',label='fit')
plt.legend()
plt.title('Fig. 3 - Fit for Time Constant')
plt.xlabel('Time (s)')
plt.ylabel('Voltage (V)')
plt.show()
The data I am trying to fit is as follows:
y = array([ 6.95301373e+12, 9.62971320e+12, 1.32501876e+13,
1.81150568e+13, 2.46111132e+13, 3.32321345e+13,
4.45978682e+13, 5.94819771e+13, 7.88394616e+13,
1.03837779e+14, 1.35888594e+14, 1.76677210e+14,
2.28196006e+14, 2.92781632e+14, 3.73133045e+14,
4.72340762e+14, 5.93892782e+14, 7.41632194e+14,
9.19750269e+14, 1.13278296e+15, 1.38551838e+15,
1.68291212e+15, 2.02996957e+15, 2.43161742e+15,
2.89259207e+15, 3.41725793e+15, 4.00937676e+15,
4.67187762e+15, 5.40667931e+15, 6.21440313e+15,
7.09421973e+15, 8.04366842e+15, 9.05855930e+15,
1.01328502e+16, 1.12585509e+16, 1.24257598e+16,
1.36226443e+16, 1.48356404e+16, 1.60496345e+16,
1.72482199e+16, 1.84140400e+16, 1.95291969e+16,
2.05757166e+16, 2.15360187e+16, 2.23933053e+16,
2.31320228e+16, 2.37385276e+16, 2.42009864e+16,
2.45114362e+16, 2.46427484e+16, 2.45114362e+16,
2.42009864e+16, 2.37385276e+16, 2.31320228e+16,
2.23933053e+16, 2.15360187e+16, 2.05757166e+16,
1.95291969e+16, 1.84140400e+16, 1.72482199e+16,
1.60496345e+16, 1.48356404e+16, 1.36226443e+16,
1.24257598e+16, 1.12585509e+16, 1.01328502e+16,
9.05855930e+15, 8.04366842e+15, 7.09421973e+15,
6.21440313e+15, 5.40667931e+15, 4.67187762e+15,
4.00937676e+15, 3.41725793e+15, 2.89259207e+15,
2.43161742e+15, 2.02996957e+15, 1.68291212e+15,
1.38551838e+15, 1.13278296e+15, 9.19750269e+14,
7.41632194e+14, 5.93892782e+14, 4.72340762e+14,
3.73133045e+14, 2.92781632e+14, 2.28196006e+14,
1.76677210e+14, 1.35888594e+14, 1.03837779e+14,
7.88394616e+13, 5.94819771e+13, 4.45978682e+13,
3.32321345e+13, 2.46111132e+13, 1.81150568e+13,
1.32501876e+13, 9.62971320e+12, 6.95301373e+12,
4.98705540e+12])
I would show you what it looks like, but apparently I don't have enough reputation points...
Anyone got any idea why it's not fitting properly?
Thanks for your help :)
The importance of the initial guess, p0 in curve_fit's default argument list, cannot be stressed enough.
Notice that the docstring mentions that
[p0] If None, then the initial values will all be 1
So if you do not supply it, it will use an initial guess of 1 for all parameters you're trying to optimize for.
The choice of p0 affects the speed at which the underlying algorithm changes the guess vector p0 (ref. the documentation of least_squares).
When you look at the data that you have, you'll notice that the maximum and the mean, mu_0, of the Gaussian-like dataset y, are
2.4e16 and 49 respectively. With the peak value so large, the algorithm, would need to make drastic changes to its initial guess to reach that large value.
When you supply a good initial guess to the curve fitting algorithm, convergence is more likely to occur.
Using your data, you can supply a good initial guess for the peak_value, the mean and sigma, by writing them like this:
y = np.array([...]) # starting from the original dataset
x = np.arange(len(y))
peak_value = y.max()
mean = x[y.argmax()] # observation of the data shows that the peak is close to the center of the interval of the x-data
sigma = mean - np.where(y > peak_value * np.exp(-.5))[0][0] # when x is sigma in the gaussian model, the function evaluates to a*exp(-.5)
popt,pcov = curve_fit(gaus, x, y, p0=[peak_value, mean, sigma])
print(popt) # prints: [ 2.44402560e+16 4.90000000e+01 1.20588976e+01]
Note that in your code, for the mean you take sum(x*y)/n , which is strange, because this would modulate the gaussian by a polynome of degree 1 (it multiplies a gaussian with a monotonously increasing line of constant slope) before taking the mean. That will offset the mean value of y (in this case to the right). A similar remark can be made for your calculation of sigma.
Final remark: the histogram of y will not resemble a Gaussian, as y is already a Gaussian. The histogram will merely bin (count) values into different categories (answering the question "how many datapoints in y reach a value between [a, b]?").

Merge Time Series in Apache Pig

I think my problem is trivial, but I'm new to Pig and I can't see an obvious answer in the documentation. I have two time series I wish to merge. Let's say one of them is just a stream of events X:
100 A
200 B
300 C
400 D
500 E
600 F
Then another one indicates when some state changes happen, call it Y.
50 on
250 off
350 on
450 off
I would like to tag the first time series X with the current on/off status from Y. So specifically I want:
100 A on
200 B on
300 C off
400 D on
500 E off
600 F off
If I was writing this in another language I might do something like merge sort X and Y and take a single pass through it, remembering the last on/off status and tagging the X entries.
What is the best way to do this in Pig? I have received some existing code which uses a JOIN of X and Y and then filters it, but I think the data inflation caused by the join is unnecessary.
I don't think there is a very easy solution. Here is some pseudo-code:
X1 = Rank X;
Y1 = Rank Y;
XY = JOIN X1 BY BY $0 LEFT OUTER, Y1 BY $0;
SPLIT XY INTO status_known IF status is not null, status_unknown OTHERWISE;
--Y2: Find out last status in Y1 (with Group all, max)
--Y3: Cross status_unknown with Y2
UNION status_known and Y3

How to Add Numbers in a Matrix to Yield Minimum Result?

This is more of an algorithmic/math problem but I'm hoping to implement a solution in C++.
Suppose I have a matrix like so where the dots represent integers:
W X Y Z
A . . . .
B . . . .
C . . . .
D . . . .
How would I yield the minimum result if I had to pick one number from each column such that there is at most one number from each row?
For instance, I could choose AW BX CY DZ or AZ BX CY DW but NOT AW BW CZ DZ
The brute force approach would seem to take n! calculations. Is there a quicker way? Eventually I would like to add numbers in matrices of size ~60.
Also, all numbers range from 0 to 256.
And if you'd rather not code it yourself, you could always use someone else' hard-work and kind publication. This one in Haskell, solves a 60x60 random matrix in less than two tenths of a second on my old laptop. What a great algorithm!
import Data.Algorithm.Munkres
import Data.Array.Unboxed
import Data.List (transpose)
solve n matrix =
hungarianMethodInt (listArray ((1,1),(n,n)) $ concat $ transpose matrix)