Update weights by different models - combinations

I have the following problem:
I must identify if a data point is an outlier or not (we don't have labels). I have different unsupervised models to identify the outlier. Then, I normalize the outlier score and I combine them via a weight average. According to the fact that I don't have information about their accuracy I use the same weight for each models.
Now, suppose that I have a small fraction of the dataset with also the label.
How can I update the weights according to the new information?
Please If you have it, give me some resources because I didn't find it.
Thank you in advance.
I tried to see some resources about the bayesian model average, but I don't know If it is the correct way. I also have implemented an idea, but I'm not sure that is correct.
import numpy as np
def bayesian_update(anomaly, weight, prob):
#posterior = prob(anomaly | model)
posterior = np.zeros(len(anomaly))
for i in range(len(anomaly)):
if anomaly[i] == 1:
posterior[i] = prob[i] * weight
else:
posterior[i] = (1-prob[i]) * weight
return posterior
np.random.seed(0)
n_observations = 100
n_models = 4
#
models_probs = np.random.rand(n_observations, n_models)
anomaly = np.where(models_probs[:, 0] > 0.5, 1, 0)
posterior_sum = np.zeros(n_models)
for i in range(n_models):
posterior_sum[i] = np.sum(bayesian_update(anomaly, 0.25, models_probs[:, i]))
new_weight = posterior_sum/np.sum(posterior_sum)
print(new_weight)

Related

PVLIB - DC Power From Irradiation - Simple Calculation

Dear pvlib users and devels.
I'm a researcher in computer science, not particularly expert in the simulation or modelling of solar panels. I'm interested in use pvlib since
we are trying to simulate the works of a small solar panel used for IoT
applications, in particular the panel spec are the following:
12.8% max efficiency, Vmp = 5.82V, size = 225 × 155 × 17 mm.
Before using pvlib, one of my collaborator wrote a code that compute the
irradiation directly from average monthly values calculated with PVWatt.
I was not really satisfied, so we are starting to use pvlib.
In the old code, we have the power and current of the panel calculated as:
W = Irradiation * PanelSize(m^2) * Efficiency
A = W / Vmp
The Irradiation, in Madrid, as been obtained with PVWatt, and this is
what my collaborator used:
DIrradiance = (2030.0,2960.0,4290.0,5110.0,5950.0,7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
I'm trying to understand if pvlib compute values similar to the ones above, as averages over a day for each month. And the curve of production in day.
I wrote this to compare pvlib with our old model:
import math
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd
import pvlib
from pvlib.location import Location
def irradiance(day,m):
DIrradiance =(2030.0,2960.0,4290.0,5110.0,5950.0,
7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,m,day,00,00),
end=dt.datetime(2015,m,day,23,59),
freq='60min')
spaout = pvlib.solarposition.spa_python(times, madrid.latitude, madrid.longitude)
spaout = spaout.assign(cosz=pd.Series(np.cos(np.deg2rad(spaout['zenith']))))
z = np.array(spaout['cosz'])
return z.clip(0)*(DIrradiance[m-1])
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start = dt.datetime(2015,8,15,00,00),
end = dt.datetime(2015,8,15,23,59),
freq='60min')
old = irradiance(15,8) # old model
new = madrid.get_clearsky(times) # pvlib irradiance
plt.plot(old,'r-') # compare them.
plt.plot(old/6.0,'y-') # old seems 6 times more..I do not know why
plt.plot(new['ghi'].values,'b-')
plt.show()
The code above compute the old irradiance, using the zenit angle. and compute the ghi values using the clear_sky. I do not understand if the values in ghi must be multiplied by the cos of zenit too, or not. Anyway
they are smaller by a factor of 6. What I'd like to have at the end is the
power and current in output from the panel (DC) without any inverter, and
we are not really interested at modelling it exactly, but at least, to
have a reasonable curve. We are able to capture from the panel the ampere
produced, and we want to compare the values from the measurements putting
the panel on the roof top with the values calculated by pvlib.
Any help on this would be really appreachiated. Thanks
Sorry Will I do not care a lot about my previous model since I'd like to move all code to pvlib. I followed your suggestion and I'm using irradiance.total_irrad, the code now looks in this way:
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,1,1,00,00),
end=dt.datetime(2015,1,1,23,59),
freq='60min')
ephem_data = pvlib.solarposition.spa_python(times, madrid.latitude,
madrid.longitude)
irrad_data = madrid.get_clearsky(times)
AM = atmosphere.relativeairmass(ephem_data['apparent_zenith'])
total = irradiance.total_irrad(40, 180,
ephem_data['apparent_zenith'], ephem_data['azimuth'],
dni=irrad_data['dni'], ghi=irrad_data['ghi'],
dhi=irrad_data['dhi'], airmass=AM,
surface_type='urban')
poa = total['poa_global'].values
Now, I know the irradiance on POA, and I want to compute the output in Ampere: It is just
(poa*PANEL_EFFICIENCY*AREA) / VOLT_OUTPUT ?
It's not clear to me how you arrived at your values for DIrradiance or what the units are, so I can't comment much the discrepancies between the values. I'm guessing that it's some kind of monthly data since there are 12 values. If so, you'd need to calculate ~hourly pvlib irradiance data and then integrate it to check for consistency.
If your module will be tilted, you'll need to convert your ~hourly irradiance GHI, DNI, DHI values to plane of array (POA) irradiance using a transposition model. The irradiance.total_irrad function is the easiest way to do that.
The next steps depend on the IV characteristics of your module, the rest of the circuit, and how accurate you need the model to be.

How to generate Scikit-Learn Gaussian Process regression with 2D input, 1D output

I have been looking for the answer to my question for quite a while. No Luck so far :(. I will my question as simple as possible. For the simplicity I only have a 2D input(it will eventually grow). Lets say I am using two variables (feature : Vehicle Odometer measurement, New Car Price) to predict a value of the car (target : Old car price) How can I train sklearn.gaussian_process.GaussianProcessRegressor to predict what I am looking for.
from sklearn import gaussian_process
X_train = np.array(X).reshape((-1, 2)).astype(int)
y_train = np.array(y).reshape(-1,1).astype(int)
GPR = gaussian_process.GaussianProcessRegressor(normalize_y = False,n_restarts_optimizer = 3)
GPR.fit(X_train,y_train)
#creating random points for testing the data
X_test_Odometer = np.linspace(0, 268000, 1000)[:, None]
X_test_Price = random.sample(range(5000, 13000), 1000)
X_test = np.column_stack((X_test_Odometer,X_test_Price)).astype(int)
GPR.predict(X_test)
This prediction doesnot work at all. I do not know whether I need to customize a kernel. If yes, I do not know how to. I am new to scikit and any help would be appreciated :)

How to create a container in pymc3

I am trying to build a model for the likelihood function of a particular outcome of a Langevin equation (Brownian particle in a harmonic potential):
Here is my model in pymc2 that seems to work:
https://github.com/hstrey/BayesianAnalysis/blob/master/Langevin%20simulation.ipynb
#define the model/function to be fitted.
def model(x):
t = pm.Uniform('t', 0.1, 20, value=2.0)
A = pm.Uniform('A', 0.1, 10, value=1.0)
#pm.deterministic(plot=False)
def S(t=t):
return 1-np.exp(-4*delta_t/t)
#pm.deterministic(plot=False)
def s(t=t):
return np.exp(-2*delta_t/t)
path = np.empty(N, dtype=object)
path[0]=pm.Normal('path_0',mu=0, tau=1/A, value=x[0], observed=True)
for i in range(1,N):
path[i] = pm.Normal('path_%i' % i,
mu=path[i-1]*s,
tau=1/A/S,
value=x[i],
observed=True)
return locals()
mcmc = pm.MCMC( model(x) )
mcmc.sample( 20000, 2000, 10 )
The basic idea is that each point depends on the previous point in the chain (Markov chain). Btw, x is an array of data, N is its length, delta_t is the time step =0.01. Any idea how to implement this in pymc3? I tried:
# define the model/function for diffusion in a harmonic potential
DHP_model = pm.Model()
with DHP_model:
t = pm.Uniform('t', 0.1, 20)
A = pm.Uniform('A', 0.1, 10)
S=1-pm.exp(-4*delta_t/t)
s=pm.exp(-2*delta_t/t)
path = np.empty(N, dtype=object)
path[0]=pm.Normal('path_0',mu=0, tau=1/A, observed=x[0])
for i in range(1,N):
path[i] = pm.Normal('path_%i' % i,
mu=path[i-1]*s,
tau=1/A/S,
observed=x[i])
Unfortunately the model crashes as soon as I try to run it. I tried some pymc3 examples (tutorial) on my machine and this is working.
Thanks in advance. I am really hoping that the new samplers in pymc3 will help me with this model. I am trying to apply Bayesian methods to single-molecule experiments.
Rather than creating many individual normally-distributed 1-D variables in a loop, you can make a custom distribution (by extending Continuous) that knows the formula for computing the log likelihood of your entire path. You can bootstrap this likelihood formula off of the Normal likelihood formula that pymc3 already knows. See the built-in AR1 class for an example.
Since your particle follows the Markov property, your likelihood looks like
import theano.tensor as T
def logp(path):
now = path[1:]
prev = path[:-1]
loglik_first = pm.Normal.dist(mu=0., tau=1./A).logp(path[0])
loglik_rest = T.sum(pm.Normal.dist(mu=prev*ss, tau=1./A/S).logp(now))
loglik_final = loglik_first + loglik_rest
return loglik_final
I'm guessing that you want to draw a value for ss at every time step, in which case you should make sure to specify ss = pm.exp(..., shape=len(x)-1), so that prev*ss in the block above gets interpreted as element-wise multiplication.
Then you can just specify your observations with
path = MyLangevin('path', ..., observed=x)
This should run much faster.
Since I did not see an answer to my question, let me answer it myself. I came up with the following solution:
# now lets model this data using pymc
# define the model/function for diffusion in a harmonic potential
DHP_model = pm.Model()
with DHP_model:
D = pm.Gamma('D',mu=mu_D,sd=sd_D)
A = pm.Gamma('A',mu=mu_A,sd=sd_A)
S=1.0-pm.exp(-2.0*delta_t*D/A)
ss=pm.exp(-delta_t*D/A)
path=pm.Normal('path_0',mu=0.0, tau=1/A, observed=x[0])
for i in range(1,N):
path = pm.Normal('path_%i' % i,
mu=path*ss,
tau=1.0/A/S,
observed=x[i])
start = pm.find_MAP()
print(start)
trace = pm.sample(100000, start=start)
unfortunately, this code takes at N=50 anywhere between 6hours to 2 days to compile. I am running on a pretty fast PC (24Gb RAM) running Ubuntu. I tried to using the GPU but that runs slightly slower. I suspect memory problems since it uses 99.8% of the memory when running. I tried the same calculation with Stan and it only takes 2min to run.

How to sample independently with pymc3

I am working with a simple bivariate normal model with a somewhat unconventional prior. The main issue I have is that my posteriors are inconsistent from one run to the next, which I'm guessing is related to an issue of high dependence between consecutive samples. Here are my specific questions.
What is the best way to get N independent samples? At the moment, I've been calling sample() to get a big chain (e.g. length 10,000) and then taking every 100th sample starting at 1,000. But looking now at an autocorrelation profile of one of the parameters, it looks like I need to take at least every 500th sample! (I could also use mutual information to get a better idea of dependence between lags.)
I've been following the fitting procedure described in the stochastic volatility example in the pymc3 tutorial. In particular I first find the MAP, then use it to generate a NUTS() object, then take a short sample, then use that to generate another NUTS() object, using gamma=0.25 (???), then finally get my big sample. I have no idea whether this is appropriate or whether I need the gamma=0.25.
Also, in that same example, there are testvals for the Exponential distribution. I don't know if I need these. (What is wrong with the default use of the mean?)
Here is the actual model I'm using.
import pymc3 as pymc
import numpy as np
import theano.tensor as th
from pymc3.distributions.continuous import Gamma, Uniform, Normal, Bounded
from pymc3.distributions.multivariate import MvNormal
from pymc3.model import Deterministic
data = np.random.randn(3000, 2) / 300 # I have actual data!
with pymc.Model():
tau = Gamma('tau', alpha=2, beta=1 / 20000)
sigma = Deterministic('sigma', 1 / th.sqrt(tau))
corr = Uniform('corr', lower=0, upper=1)
alpha_sig = Deterministic('alpha_sig', sigma / 50)
alpha_post = Normal('alpha_post', mu=0, sd=alpha_sig)
alpha_pre = Bounded(
'alpha_pre', Normal, alpha_post, np.Inf, mu=0, sd=alpha_sig)
corr_inv = th.stack([th.stack([1, -corr]),
th.stack([-corr, 1])]) / (1 - th.sqr(corr))
MvNormal(
'data', mu=th.stack([alpha_post, alpha_pre]),
tau=tau * corr_inv, observed=data)
map_ = pymc.find_MAP()
step1 = pymc.NUTS(scaling=map_)
trace1 = pymc.sample(1000, step=step1)
step2 = pymc.NUTS(scaling=trace1[-1], gamma=0.25)
trace2 = pymc.sample(10000, step=step2, start=trace1[-1])
I'm not sure what you're doing with the complex prior structure you have set up but I think there is something wrong there.
I simplified the model to:
import pymc3 as pymc
import numpy as np
import theano.tensor as th
from pymc3.distributions.continuous import Gamma, Uniform, Normal, Bounded
from pymc3.distributions.multivariate import MvNormal
from pymc3.model import Deterministic
data = np.random.randn(3000, 2) # I have actual data!
with pymc.Model():
corr = Uniform('corr', lower=0, upper=1)
corr_inv = th.stack([th.stack([1, -corr]),
th.stack([-corr, 1])]) / (1 - th.sqr(corr))
mu = Normal('mu', mu=0, sd=1, shape=2)
MvNormal('data',
mu=mu,
tau=corr_inv,
observed=data)
map_ = pymc.find_MAP()
step1 = pymc.NUTS(scaling=map_)
trace1 = pymc.sample(1000, step=step1)
step2 = pymc.NUTS(scaling=trace1[-1])
trace2 = pymc.sample(10000, step=step2, start=trace1[-1])
Which has great convergence. I think you can also just drop the gamma parameter.

Django: How do i store a geo point in database

Needed correct datatype for geo points.
I will get and display it with google map API so format like
42.761819,11.104863
41.508577,-101.953125
Usecase:
user click on map
django save this point with additional data
on next visiting django display this points on map
So, no distances beetween points and etc hacks.
DB: postgres 8
Django: 1.4
Check out GeoDjango and see if it helps you. If your stack is configured to run GeoDjango, you can do this.
Your models will looks like this
from django.contrib.gis.db import models
class LocationPoint(models.Model):
point = models.PointField(srid=4326,dim=3)
accuracy = models.FloatField(default=0.0)
objects = models.GeoManager()
To save the point to the database all you will have to do is
from django.contrib.gis.geos import Point
point = Point(x=12.734534, y=77.2342, z=0, srid=4326)
location = LocationPoint()
location.point = point
location.save()
GeoDjango gives you extended abilities to do geospatial queries which you might be interested in the near future, like finding the distance between points, finding the nearest locations around a point etc.
Here's the link to GeoDjango
From django documentation about DecimalField:
DecimalField.max_digits
The maximum number of digits allowed in the
number. Note that this number must be greater than or equal to
decimal_places.
DecimalField.decimal_places
The number of decimal places to store with
the number.
which is refering to the Python Decimal
To make good choice about accurate data type and precission you should consider:
what is minimum possible value (latitude can be from 0 (down)up to (-)90 degrees) _ _.
what is maximum possible value (longitude can range from 0 (down)up to (-)180 degrees) _ _ _.
what is accuracy (decimal_places), you wish. Pleas notice that it has impact on zoom level on Google Maps.
By the way, for better understanding, it is good to know how the calculation is done (Python code):
def deg_to_dms(deg):
d = int(deg)
md = abs(deg - d) * 60
m = int(md)
sd = (md - m) * 60
return [d, m, sd]
def decimal(deg,min,sec):
if deg < 0:
dec= -1.0 * deg + 1.0 * min/60.0 + 1.0 * sec/3600.0
return -1.0 * dec
else:
dec=1.0 * deg + 1.0 * min/60.0 + 1.0 * sec/3600.0;
return dec
It looks like you're going to be storing a latitude and a longitude. I would go with a DecimalField for this, and store each number separately.
I use longitude and latitude in my django setup.
My model includes:
long_position = models.DecimalField (max_digits=8, decimal_places=3)
lat_position = models.DecimalField (max_digits=8, decimal_places=3)
For more precision you may want the decimal_places to be more.
When you want to display it in the Google Map API method you would reference your model and write a python code to output like this:
output = some_long_position + "," + some_lati_position
I'm using this in model
latitude = models.DecimalField(max_digits=11, decimal_places=7,null=True,blank=True)
longitude = models.DecimalField(max_digits=11, decimal_places=7,null=True,blank=True)