GaussianNB :- ValueError: The sum of the priors should be 1 - python-2.7

What i am trying to do?
I am trying to train the dataset which has 10 labels using GaussianNB classifier but while tunning my gaussianNB prior parameters i am getting this error:-
File "/home/mg/anaconda2/lib/python2.7/site-packages/sklearn/naive_bayes.py", line 367, in _partial_fit
raise ValueError('The sum of the priors should be 1.')
ValueError: The sum of the priors should be 1.
Code for this:-
clf = GaussianNB(priors = [0.08, 0.14, 0.03, 0.16, 0.11, 0.16, 0.07, 0.14, 0.11, 0.0])
You can see the sum is clearly 1 but it showing me this error, can you point the error.

This looks like a pretty bad design-decision within sklearn as they are doing the usual don't compare floating-point numbers stuff (what every computer scientist should know about floating-point arithmetic), which surprises me (as sklearn is usually high-quality code)!
(I don't see any wrong usage on your end, despite using a list. The docs call for an array, not array-like like in many other cases, but their code is doing the array-conversion nonetheless)
Their code:
if self.priors is not None:
priors = np.asarray(self.priors)
# Check that the provide prior match the number of classes
if len(priors) != n_classes:
raise ValueError('Number of priors must match number of'
' classes.')
# Check that the sum is 1
if priors.sum() != 1.0:
raise ValueError('The sum of the priors should be 1.')
# Check that the prior are non-negative
if (priors < 0).any():
raise ValueError('Priors must be non-negative.')
self.class_prior_ = priors
else:
# Initialize the priors to zeros for each class
self.class_prior_ = np.zeros(len(self.classes_),
dtype=np.float64)
So:
You give a list, but their code will create an numpy-array
Therefore np.sum() will be used for summing
There might be fp-math related numerical-errors in summing like in your case
your sum is technically != 1.0; but very close to it!
fp-comparison x == 1.0 is considered bad!
numpy brings np.isclose() which is the usual approach of doing this
Demo:
import numpy as np
priors = np.array([0.08, 0.14, 0.03, 0.16, 0.11, 0.16, 0.07, 0.14, 0.11, 0.0])
my_sum = np.sum(priors)
print('my_sum: ', my_sum)
print('naive: ', my_sum == 1.0)
print('safe: ', np.isclose(my_sum, 1.0))
Output:
('my_sum: ', 1.0000000000000002)
('naive: ', False)
('safe: ', True)
Edit:
As i think that this code is not good, i posted an issue here which you can follow to see if they comply or not.
numpy.random.sample(), which also takes such a vector, is actually doing a fp-safe approach too (numerically more stable summation + epsilon-check; but not using np.isclose()) as seen here.

Related

Pvlib / Bird1984: North-facing element shows negative Irradiance

When using pvlib (but also the spectrl2 implementation provided by NREL), I obtain negative Irradiance for a north-facing panel.
Is this expected behaviour? Should the spectrum simply be cut at zero?
Added example code based on the tutorial below:
## Using PV Lib
from pvlib import spectrum, solarposition, irradiance, atmosphere
import pandas as pd
import matplotlib.pyplot as plt
# assumptions from the technical report:
lat = 49.88
lon = 8.63
tilt = 45
azimuth = 0 # North = 0
pressure = 101300 # sea level, roughly
water_vapor_content = 0.5 # cm
tau500 = 0.1
ozone = 0.31 # atm-cm
albedo = 0.2
times = pd.date_range('2021-11-30 8:00', freq='h', periods=6, tz="Europe/Berlin") # , tz='Etc/GMT+9'
solpos = solarposition.get_solarposition(times, lat, lon)
aoi = irradiance.aoi(tilt, azimuth, solpos.apparent_zenith, solpos.azimuth)
# The technical report uses the 'kasten1966' airmass model, but later
# versions of SPECTRL2 use 'kastenyoung1989'. Here we use 'kasten1966'
# for consistency with the technical report.
relative_airmass = atmosphere.get_relative_airmass(solpos.apparent_zenith,
model='kasten1966')
spectra = spectrum.spectrl2(
apparent_zenith=solpos.apparent_zenith,
aoi=aoi,
surface_tilt=tilt,
ground_albedo=albedo,
surface_pressure=pressure,
relative_airmass=relative_airmass,
precipitable_water=water_vapor_content,
ozone=ozone,
aerosol_turbidity_500nm=tau500,
)
plt.figure()
plt.plot(spectra['wavelength'], spectra['poa_global'])
plt.xlim(200, 2700)
# plt.ylim(0, 1.8)
plt.title(r"2021-11-30, Darmstadt, $\tau=0.1$, Wv=0.5 cm")
plt.ylabel(r"Irradiance ($W m^{-2} nm^{-1}$)")
plt.xlabel(r"Wavelength ($nm$)")
time_labels = times.strftime("%H:%M %p")
labels = [
"AM {:0.02f}, Z{:0.02f}, {}".format(*vals)
for vals in zip(relative_airmass, solpos.apparent_zenith, time_labels)
]
plt.legend(labels)
plt.show()
No, this is not expected behavior. I suspect the issue is caused by improper handling of angle-of-incidence values greater than 90 degrees, and essentially the same problem (for a different function) discussed here: https://github.com/pvlib/pvlib-python/issues/526
It's unfortunate that the reference implementation from NREL has the problem too (perhaps when the model was originally designed, nobody could conceive of a panel facing away from the sun!), but I think the pvlib implementation should be fixed regardless. I encourage you to file a bug report here: https://github.com/pvlib/pvlib-python/issues
In the meantime, I think you can resolve the issue in your own code by adding a line like aoi[aoi > 90] = 90 prior to passing it to spectrum.spectrl2, although be careful about this if you end up using aoi for other purposes later in the script. I would be interested to hear if the resulting spectra are consistent with your expectations.
Edit for posterity: a github issue has been opened here: https://github.com/pvlib/pvlib-python/issues/1348

How many FLOPs are there in calculating a factorial using math.factorial(n) in python

I am trying to understand how many FLOPs are there if I use a certain algorithm to find the exponential approximated sum, specially If I use math.factorial(n) in python. I understand FLOPs for binary operation, so is factorial also a binary operation here within a function? Not being a computer science major, I have some difficulties with these. My code looks like this:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
import math
x = input ("please enter a number for which you want to run the exponential summation e^{x}")
N = input ("Please enter an integer before which term you want to turncate your summation")
function= math.exp(x)
exp_sum = 0.0
abs_err = 0.0
rel_err = 0.0
for n in range (0, N):
factorial = math.factorial(n) #How many FLOPs here?
power = x**n # calculates N-1 times
nth_term = power/factorial #calculates N-1 times
exp_sum = exp_sum + nth_term #calculates N-1 times
abs_err = abs(function - exp_sum)
rel_err = abs(abs_err)/abs(function)
Please help me understand this. I might also be wrong about the other FLOPs!
According to that SO answer and to the C source code, in python2.7 math.factorial(n) uses a naive algorithm to compute the factorial so it computes using about n operations as factorial(n)=1*2*3*4*...*n.
A small mistake regarding the rest is that for n in range(0,N) will loop N times , not N-1 (from n=0 to n=N-1).
A final note is that counting FLOP may not be representative of the actual algorithm real world performance especially in python that is an interpretted language and that it tends to hide most of its inner working behind clever syntax that links to compiled C code(eg: exp_sum + nth_term is actualy exp_sum.__add__(nth_term)).

numpy arange for floating point start & stop

I want numpy to create a full list, given the parameters start, stop and increment, but ran into some troubles:
In[2]: import numpy as np
In[3]: np.arange(2.0, 2.4, 0.2)
Out[3]: array([ 2. , 2.2])
In[4]: np.arange(2.0, 2.6, 0.2)
Out[4]: array([ 2. , 2.2, 2.4, 2.6])
In[5]: np.arange(2.0, 2.8, 0.2)
Out[5]: array([ 2. , 2.2, 2.4, 2.6])
What I actually want is:
array([ 2. , 2.2, 2.4])
Now, I've learned that I should avoid the floating point data type if it comes down to fixed values. I know it would be better to multiply start/stop/increment by 100, but the problem is that I cannot tell, how many decimals the user is going to supply. Is there any way I can still do that with Float or is there a better way to solve this?
Edit:
It works now with the obvious solution of adding 0.0000001 to the end-value. But this looks horrible in my code...I'd hope to fix this nicely somehow
Could you specify which values the user is supposed to enter? For that kind of generation, I think linspace could be better as it includes the end parameter
EDIT: if the user enters start, end, and increment, just use linspace with num = int((end-start)/increment+1) if the exact value of the increment is not critical.
EDIT2:
adapt 1e-4 to the relative error you deem acceptable (you can even add it as a user-defined variable).
eps = 1e-4*(stop-start)
num = int((stop-start)/(incr-eps)+1)
np.linspace(start, stop,num)
this might seem a little longer but if you are keen on using np.arange this is how I worked it out:
decimal_places = decimal.Decimal(str(STEP)).as_tuple().exponent
power_10_multiplier = 10**-decimal_places
MIN = int(MIN*power_10_multiplier)
MAX = int(MAX*power_10_multiplier)
STEP = int(STEP*power_10_multiplier)
arr = np.arange(MIN, MAX + STEP, step=STEP)/power_10_multiplier

Replicate IDL 'smooth' in Python 2.7

I have been trying to work out how to replicate IDL's smooth function in Python and I just can't get anything like the same results. (Disclaimer: It is probably 10 years since I touched this kind of mathematical problem so it has been dumped to make way for information like where to find the cheapest local fuel). I am trying to code this:
smooth(b,w,/nan)
where b is a 2D float array containing NANs (zeros - missing data - have also been converted to NAN).
From the IDL documents, it appears smooth uses a boxcar, so from scipy.ndimage.filters I have tried:
bsmooth = uniform_filter(b, w)
I am aware that there are some fundamental differences here:
the default edge behaviour from IDL is "the end points are copied
from the original array to the result with no smoothing" whereas I
don't seem to have the option to do this with the uniform filter.
Treatment of the NaN elements. In IDL, the /nan keyword seems to
mean that where possible the NaN values will be filled by the result
of the other points in the window. If there are no valid points to
generate a result, by a MISSING keyword. I thought I could
approximate this behaviour following the smoothing using
scipy.interpolate's NearestNDInterpolator (thanks to the brilliant
explanation by Alex on here:
filling gaps on an image using numpy and scipy)
Here is my test array:
>>>b array([[ 0.97599638, 0.93114936, 0.87070072, 0.5379253 ],
[ 0.34873217, nan, 0.40985891, 0.22407863],
[ nan, nan, nan, 0.67532134],
[ nan, nan, 0.85441768, nan]])
My answers bore not the SLIGHTEST resemblance to IDL, whether I use the /nan keyword or not.
IDL> smooth(b,2,/nan)
0.97599638 0.93114936 0.87070072 0.53792530
0.34873217 0.70728749 0.60817236 0.22407863
NaN 0.53766960 0.54091913 0.67532134
NaN NaN 0.85441768 NaN
IDL> smooth(b,2)
0.97599638 0.93114936 0.87070072 0.53792530
0.34873217 -NaN -NaN 0.22407863
-NaN -NaN -NaN 0.67532134
-NaN -NaN 0.85441768 NaN
I confess I find the scipy documentation rather sparse on detail so I have no idea if I am really doing what I think I doing. The fact that the two python approaches which I believed would both smooth the image give different answers suggests that things are not what I understood them to be.
>>>uniform_filter(b, 2)
array([[ 0.97599638, 0.95357287, 0.90092504, 0.70431301],
[ 0.66236428, nan, nan, nan],
[ nan, nan, nan, nan],
[ nan, nan, nan, nan]])
I thought it was a bit odd it was so empty so I tried this with an array of 100 elements (still using a window of 2) and output the images. The results (first image is 'b' second is 'bsmooth') are not quite what I was hoping for:
Going back to the smaller array and following the examples in: http://scipy.github.io/old-wiki/pages/Cookbook/SignalSmooth which I thought would give the same output as uniform_filter, I tried:
>>> box = np.array([1,1,1,1])
>>> box = box.reshape(2,2)
>>> box
array([[1, 1],
[1, 1]])
>>> bsmooth = scipy.signal.convolve2d(b,box,mode='same')
>>> print bsmooth
[[ 0.97599638 1.90714574 1.80185008 1.40862602]
[ 1.32472855 nan nan 2.04256356]
[ nan nan nan nan]
[ nan nan nan nan]]
Obviously I have completely misunderstood the scipy functions, maybe even the IDL one. If anyone can help me to replicate the IDL smooth function as closely as possible, I would be extremely grateful. I am under considerable time pressure to get a solution for this that doesn't rely on IDL and I am tossing a coin to decide whether to code the function from scratch or develop a very contagious illness.
How can I perform the same smoothing in python?
First: Please use matplotlib.pyplot.imshow with interpolation="none" that's nicer to look at and maybe with greyscale.
So for your example: There is actually no convolution (filter) within scipy and numpy that treat's NaN as missing values (they propagate them within the convolution). At least I've found none so far and your boundary-treatement is also (to my knowledge) not implemented. But the boundary could be just replaced afterwards.
If you want to do convolution with NaN you can for example use astropy.convolution.convolve. There NaNs are interpolated using the kernel of your filter. But their convolution has some drawbacks as well: Border handling like you want isn't implemented there neither and your kernel must be of odd shape and the sum of your kernel must not be zero (or very close to it)
For example:
from astropy.convolution import convolve
import numpy as np
array = np.random.uniform(10,100, (4,4))
array[1,1] = np.nan
kernel = np.ones((3,3))
convolve(array, kernel)
as an example an initial array of
array([[ 97.19514587, 62.36979751, 93.54811286, 30.23567842],
[ 51.02184613, nan, 46.14769821, 60.08088041],
[ 20.86482452, 42.39661484, 36.96961278, 96.89180175],
[ 45.54453509, 76.61274347, 46.44485141, 25.40985372]])
will become:
array([[ 266.9009961 , 406.59680717, 348.69637399, 230.01236989],
[ 330.16243546, 506.82785931, 524.95440336, 363.87378443],
[ 292.75477064, 422.31693304, 487.26826319, 311.94469828],
[ 185.41871792, 268.83318211, 324.72547798, 205.71611967]])
if you want to "normalize" it, astropy offers the normalize_kernel parameter:
convolved = convolve(array, kernel, normalize_kernel=True)
array([[ 29.58753936, 42.09982189, 49.31793529, 33.00203873],
[ 49.87040638, 65.67695002, 66.10447436, 40.44026448],
[ 52.51126383, 63.03914444, 60.85474739, 35.88011742],
[ 39.40188443, 46.82350749, 40.1380926 , 22.46090152]])
If you want to replace the "edge" values with the ones from the original array just replace them:
convolved[0,:] = array[0,:]
convolved[-1,:] = array[-1,:]
convolved[:,0] = array[:,0]
convolved[:,-1] = array[:,-1]
So that's what the existing packages offer (as far as I know it). If you want to learn a bit of Cython or numba you can easily write your own convolutions that is not much slower (only a factor of 2-10) than the numpy/scipy ones but does EXACTLY what you want without messing around.
Since this is not something that is available in the python packages and because I saw the question asked several times during my research without satisfactory answers, here is how I solved the issue.
Provided is a test version of my function that I'm off to tidy up. I am sure there will be better ways to do the things I have done as I'm still fairly new to Python - please do recommend any appropriate changes.
Plots use autumn colourmap just because it allowed me to see the NaNs clearly.
My results:
IDL propagate
0.033369284 0.067915268 0.96602046 0.85623550
0.30435592 NaN NaN 100.00000
0.94065958 NaN NaN 0.90966976
0.018516513 0.044460904 0.051047217 NaN
python propagate
[[ 3.33692829e-02 6.79152655e-02 9.66020487e-01 8.56235492e-01]
[ 3.04355923e-01 nan nan 1.00000000e+02]
[ 9.40659566e-01 nan nan 9.09669768e-01]
[ 1.85165123e-02 4.44609040e-02 5.10472165e-02 nan]]
IDL replace
0.033369284 0.067915268 0.96602046 0.85623550
0.30435592 0.47452110 14.829881 100.00000
0.94065958 0.33833817 17.002417 0.90966976
0.018516513 0.044460904 0.051047217 NaN
python replace
[[ 3.33692829e-02 6.79152655e-02 9.66020487e-01 8.56235492e-01]
[ 3.04355923e-01 4.74521092e-01 1.48298812e+01 1.00000000e+02]
[ 9.40659566e-01 3.38338177e-01 1.70024175e+01 9.09669768e-01]
[ 1.85165123e-02 4.44609040e-02 5.10472165e-02 nan]]
My function:
#!/usr/bin/env python
# smooth.py
__version__ = 0.1
# Version 0.1 29 Feb 2016 ELH Test release
import numpy as np
import matplotlib.pyplot as mp
def Smooth(v1, w, nanopt):
# v1 is the input 2D numpy array.
# w is the width of the square window along one dimension
# nanopt can be replace or propagate
'''
v1 = np.array(
[[3.33692829e-02, 6.79152655e-02, 9.66020487e-01, 8.56235492e-01],
[3.04355923e-01, np.nan , 4.86013025e-01, 1.00000000e+02],
[9.40659566e-01, 5.23314093e-01, np.nan , 9.09669768e-01],
[1.85165123e-02, 4.44609040e-02, 5.10472165e-02, np.nan ]])
w = 2
'''
mp.imshow(v1, interpolation='None', cmap='autumn')
mp.show()
# make a copy of the array for the output:
vout=np.copy(v1)
# If w is even, add one
if w % 2 == 0:
w = w + 1
# get the size of each dim of the input:
r,c = v1.shape
# Assume that w, the width of the window is always square.
startrc = (w - 1)/2
stopr = r - ((w + 1)/2) + 1
stopc = c - ((w + 1)/2) + 1
# For all pixels within the border defined by the box size, calculate the average in the window.
# There are two options:
# Ignore NaNs and replace the value where possible.
# Propagate the NaNs
for col in range(startrc,stopc):
# Calculate the window start and stop columns
startwc = col - (w/2)
stopwc = col + (w/2) + 1
for row in range (startrc,stopr):
# Calculate the window start and stop rows
startwr = row - (w/2)
stopwr = row + (w/2) + 1
# Extract the window
window = v1[startwr:stopwr, startwc:stopwc]
if nanopt == 'replace':
# If we're replacing Nans, then select only the finite elements
window = window[np.isfinite(window)]
# Calculate the mean of the window
vout[row,col] = np.mean(window)
mp.imshow(vout, interpolation='None', cmap='autumn')
mp.show()
return vout

Solve an equation using a python numerical solver in numpy

I have an equation, as follows:
R - ((1.0 - np.exp(-tau))/(1.0 - np.exp(-a*tau))) = 0.
I want to solve for tau in this equation using a numerical solver available within numpy. What is the best way to go about this?
The values for R and a in this equation vary for different implementations of this formula, but are fixed at particular values when it is to be solved for tau.
In conventional mathematical notation, your equation is
The SciPy fsolve function searches for a point at which a given expression equals zero (a "zero" or "root" of the expression). You'll need to provide fsolve with an initial guess that's "near" your desired solution. A good way to find such an initial guess is to just plot the expression and look for the zero crossing.
#!/usr/bin/python
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fsolve
# Define the expression whose roots we want to find
a = 0.5
R = 1.6
func = lambda tau : R - ((1.0 - np.exp(-tau))/(1.0 - np.exp(-a*tau)))
# Plot it
tau = np.linspace(-0.5, 1.5, 201)
plt.plot(tau, func(tau))
plt.xlabel("tau")
plt.ylabel("expression value")
plt.grid()
plt.show()
# Use the numerical solver to find the roots
tau_initial_guess = 0.5
tau_solution = fsolve(func, tau_initial_guess)
print "The solution is tau = %f" % tau_solution
print "at which the value of the expression is %f" % func(tau_solution)
You can rewrite the equation as
For integer a and non-zero R you will get a solutions in the complex space;
There are analytical solutions for a=0,1,...4(see here);
So in general you may have one, multiple or no solution and some or all of them may be complex values. You may easily throw scipy.root at this equation, but no numerical method will guarantee to find all the solutions.
To solve in the complex space:
import numpy as np
from scipy.optimize import root
def poly(xs, R, a):
x = complex(*xs)
err = R * x - x + 1 - R
return [err.real, err.imag]
root(poly, x0=[0, 0], args=(1.2, 6))