How do I determine or better specify the horizon for present and past predictions? - pvlib

I am having a conceptual problem with pvlib's predictions: The problem is that if I ask for "past predictions" then I do not know what the temporal horizon is for the prediction. For actual future predictions it is a little more obvious, naively I would just subtract the present time from the timestamp of the prediction returned, although if data requests (model dependent) are made only at hourly or 6-hourly intervals, then it seems like I would have to add that uncertainty to the horizon, so I am still unsure.
For past predictions, I just have no idea what the horizon is. How can this be determined?
This question applies to both pvlib-python's standard way to get data/ forecasts and I think it also applies to the special script to get data for predictions further into the past.
Any help would be appreciated in understanding this situation.
To try to make this question more concrete I am including this bit of code taken from forecast_to_power.ipynb with the start and end times modified to be in the past:
# built-in python modules
import datetime
import inspect
import os
# scientific python add-ons
import numpy as np
import pandas as pd
# plotting stuff
# first line makes the plots appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
# finally, we import the pvlib library
from pvlib import solarposition,irradiance,atmosphere,pvsystem
from pvlib.forecast import GFS, NAM, NDFD, RAP, HRRR
# Choose a location.
# Tucson, AZ
latitude = 32.2
longitude = -110.9
tz = 'US/Mountain'
surface_tilt = 30
surface_azimuth = 180 # pvlib uses 0=North, 90=East, 180=South, 270=West convention
albedo = 0.2
# for this example, let's predict into the past:
start = pd.Timestamp(, tz=tz) - pd.Timedelta(days=14) # 14 days ago
end = start + pd.Timedelta(days=7) # 7 days from start
fm = GFS()
forecast_data = fm.get_processed_data(latitude, longitude, start, end)
temp_air wind_speed ghi dni dhi total_clouds low_clouds mid_clouds high_clouds
2019-02-25 06:00:00-07:00 6.581512 1.791610 0.000000 0.000000 0.000000 33.0 0.0 0.0 33.0
2019-02-25 09:00:00-07:00 4.832214 0.567790 392.833659 668.164855 121.831040 0.0 0.0 0.0 0.0
2019-02-25 12:00:00-07:00 3.409973 0.860611 794.120954 910.658669 118.492918 0.0 0.0 0.0 0.0
2019-02-25 15:00:00-07:00 6.841797 0.942555 529.425232 515.727013 222.689391 22.0 0.0 0.0 22.0
2019-02-25 18:00:00-07:00 24.458038 0.466084 11.339769 0.000000 11.339769 52.0 0.0 0.0 52.0
What is the temporal horizon for this back prediction? Can I adjust it? If so how?

For actual future predictions it is a little more obvious, naively I would just subtract the present time from the timestamp of the prediction returned
Yes, this is correct for true predictions. For past predictions, you should define the horizon in a way that is consistent with your ability to make true predictions.
NCEP maintains a model status page that details the typical times at which the weather model data is available on its servers. Each model has a different delay between its initialization time and its forecast availability.
The Solar Forecast Arbiter definitions document might also help.


Pvlib / Bird1984: North-facing element shows negative Irradiance

When using pvlib (but also the spectrl2 implementation provided by NREL), I obtain negative Irradiance for a north-facing panel.
Is this expected behaviour? Should the spectrum simply be cut at zero?
Added example code based on the tutorial below:
## Using PV Lib
from pvlib import spectrum, solarposition, irradiance, atmosphere
import pandas as pd
import matplotlib.pyplot as plt
# assumptions from the technical report:
lat = 49.88
lon = 8.63
tilt = 45
azimuth = 0 # North = 0
pressure = 101300 # sea level, roughly
water_vapor_content = 0.5 # cm
tau500 = 0.1
ozone = 0.31 # atm-cm
albedo = 0.2
times = pd.date_range('2021-11-30 8:00', freq='h', periods=6, tz="Europe/Berlin") # , tz='Etc/GMT+9'
solpos = solarposition.get_solarposition(times, lat, lon)
aoi = irradiance.aoi(tilt, azimuth, solpos.apparent_zenith, solpos.azimuth)
# The technical report uses the 'kasten1966' airmass model, but later
# versions of SPECTRL2 use 'kastenyoung1989'. Here we use 'kasten1966'
# for consistency with the technical report.
relative_airmass = atmosphere.get_relative_airmass(solpos.apparent_zenith,
spectra = spectrum.spectrl2(
plt.plot(spectra['wavelength'], spectra['poa_global'])
plt.xlim(200, 2700)
# plt.ylim(0, 1.8)
plt.title(r"2021-11-30, Darmstadt, $\tau=0.1$, Wv=0.5 cm")
plt.ylabel(r"Irradiance ($W m^{-2} nm^{-1}$)")
plt.xlabel(r"Wavelength ($nm$)")
time_labels = times.strftime("%H:%M %p")
labels = [
"AM {:0.02f}, Z{:0.02f}, {}".format(*vals)
for vals in zip(relative_airmass, solpos.apparent_zenith, time_labels)
No, this is not expected behavior. I suspect the issue is caused by improper handling of angle-of-incidence values greater than 90 degrees, and essentially the same problem (for a different function) discussed here:
It's unfortunate that the reference implementation from NREL has the problem too (perhaps when the model was originally designed, nobody could conceive of a panel facing away from the sun!), but I think the pvlib implementation should be fixed regardless. I encourage you to file a bug report here:
In the meantime, I think you can resolve the issue in your own code by adding a line like aoi[aoi > 90] = 90 prior to passing it to spectrum.spectrl2, although be careful about this if you end up using aoi for other purposes later in the script. I would be interested to hear if the resulting spectra are consistent with your expectations.
Edit for posterity: a github issue has been opened here:

PVLIB - DC Power From Irradiation - Simple Calculation

Dear pvlib users and devels.
I'm a researcher in computer science, not particularly expert in the simulation or modelling of solar panels. I'm interested in use pvlib since
we are trying to simulate the works of a small solar panel used for IoT
applications, in particular the panel spec are the following:
12.8% max efficiency, Vmp = 5.82V, size = 225 × 155 × 17 mm.
Before using pvlib, one of my collaborator wrote a code that compute the
irradiation directly from average monthly values calculated with PVWatt.
I was not really satisfied, so we are starting to use pvlib.
In the old code, we have the power and current of the panel calculated as:
W = Irradiation * PanelSize(m^2) * Efficiency
A = W / Vmp
The Irradiation, in Madrid, as been obtained with PVWatt, and this is
what my collaborator used:
DIrradiance = (2030.0,2960.0,4290.0,5110.0,5950.0,7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
I'm trying to understand if pvlib compute values similar to the ones above, as averages over a day for each month. And the curve of production in day.
I wrote this to compare pvlib with our old model:
import math
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd
import pvlib
from pvlib.location import Location
def irradiance(day,m):
DIrradiance =(2030.0,2960.0,4290.0,5110.0,5950.0,
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,m,day,00,00),
spaout = pvlib.solarposition.spa_python(times, madrid.latitude, madrid.longitude)
spaout = spaout.assign(cosz=pd.Series(np.cos(np.deg2rad(spaout['zenith']))))
z = np.array(spaout['cosz'])
return z.clip(0)*(DIrradiance[m-1])
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start = dt.datetime(2015,8,15,00,00),
end = dt.datetime(2015,8,15,23,59),
old = irradiance(15,8) # old model
new = madrid.get_clearsky(times) # pvlib irradiance
plt.plot(old,'r-') # compare them.
plt.plot(old/6.0,'y-') # old seems 6 times more..I do not know why
The code above compute the old irradiance, using the zenit angle. and compute the ghi values using the clear_sky. I do not understand if the values in ghi must be multiplied by the cos of zenit too, or not. Anyway
they are smaller by a factor of 6. What I'd like to have at the end is the
power and current in output from the panel (DC) without any inverter, and
we are not really interested at modelling it exactly, but at least, to
have a reasonable curve. We are able to capture from the panel the ampere
produced, and we want to compare the values from the measurements putting
the panel on the roof top with the values calculated by pvlib.
Any help on this would be really appreachiated. Thanks
Sorry Will I do not care a lot about my previous model since I'd like to move all code to pvlib. I followed your suggestion and I'm using irradiance.total_irrad, the code now looks in this way:
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,1,1,00,00),
ephem_data = pvlib.solarposition.spa_python(times, madrid.latitude,
irrad_data = madrid.get_clearsky(times)
AM = atmosphere.relativeairmass(ephem_data['apparent_zenith'])
total = irradiance.total_irrad(40, 180,
ephem_data['apparent_zenith'], ephem_data['azimuth'],
dni=irrad_data['dni'], ghi=irrad_data['ghi'],
dhi=irrad_data['dhi'], airmass=AM,
poa = total['poa_global'].values
Now, I know the irradiance on POA, and I want to compute the output in Ampere: It is just
It's not clear to me how you arrived at your values for DIrradiance or what the units are, so I can't comment much the discrepancies between the values. I'm guessing that it's some kind of monthly data since there are 12 values. If so, you'd need to calculate ~hourly pvlib irradiance data and then integrate it to check for consistency.
If your module will be tilted, you'll need to convert your ~hourly irradiance GHI, DNI, DHI values to plane of array (POA) irradiance using a transposition model. The irradiance.total_irrad function is the easiest way to do that.
The next steps depend on the IV characteristics of your module, the rest of the circuit, and how accurate you need the model to be.

Remove weekends in finance plots with volume overlay [duplicate]

I've been having some difficulty with Matplotlib's finance charting. It seems like their candlestick charts work best with daily data, and I am having a hard time making them work with intraday (every 5 minutes, between 9:30 and 4 pm) data.
I have pasted sample data in pastebin. The top is what I get from the database, and the bottom is tupled with the date formatted into an ordinal float for use in Matplotlib.
Link to sample data
When I draw my charts there are huge gaps in it, the axes suck, and the zoom is equally horrible.
How do I make a nice readable graph out of this data? My ultimate goal is to get a chart that looks remotely like this:
The data points can be in various increments from 5 minutes to 30 minutes.
I have also made a Pandas dataframe of the data, but I am not sure if pandas has candlestick functionality.
If I understand well, one of your major concern is the gaps between the daily data.
To get rid of them, one method is to artificially 'evenly space' your data (but of course you will loose any temporal indication intra-day).
Anyways, doing this way, you will be able to obtain a chart that looks like the one you have proposed as an example.
The commented code and the resulting graph are below.
import numpy as np
import matplotlib.pyplot as plt
import datetime
from import candlestick
from matplotlib.dates import num2date
# data in a text file, 5 columns: time, opening, close, high, low
# note that I'm using the time you formated into an ordinal float
data = np.loadtxt('finance-data.txt', delimiter=',')
# determine number of days and create a list of those days
ndays = np.unique(np.trunc(data[:,0]), return_index=True)
xdays = []
for n in np.arange(len(ndays[0])):
# creation of new data by replacing the time array with equally spaced values.
# this will allow to remove the gap between the days, when plotting the data
data2 = np.hstack([np.arange(data[:,0].size)[:, np.newaxis], data[:,1:]])
# plot the data
fig = plt.figure(figsize=(10, 5))
ax = fig.add_axes([0.1, 0.2, 0.85, 0.7])
# customization of the axis
ax.tick_params(axis='both', direction='out', width=2, length=8,
labelsize=12, pad=8)
# set the ticks of the x axis only when starting a new day
ax.set_xticklabels(xdays, rotation=45, horizontalalignment='right')
ax.set_ylabel('Quote ($)', size=20)
ax.set_ylim([177, 196])
candlestick(ax, data2, width=0.5, colorup='g', colordown='r')
I got tired of matplotlib's (and plotly's) bad performance and lack of such features you request, so implemented one of my own. Here's how that works:
import finplot as fplt
import yfinance
df ='AAPL')
fplt.candlestick_ochl(df[['Open', 'Close', 'High', 'Low']])
Not only are days in which the exchange is closed left out automatically, but also has better performance and a nicer api. For something that more resembles what you're ultimately looking for:
import finplot as fplt
import yfinance
symbol = 'AAPL'
df =
ax = fplt.create_plot(symbol)
fplt.candlestick_ochl(df[['Open', 'Close', 'High', 'Low']], ax=ax)
fplt.plot(df['Close'].rolling(200).mean(), ax=ax, legend='SMA 200')
fplt.plot(df['Close'].rolling(50).mean(), ax=ax, legend='SMA 50')
fplt.plot(df['Close'].rolling(20).mean(), ax=ax, legend='SMA 20')
fplt.volume_ocv(df[['Open', 'Close', 'Volume']], ax=ax.overlay())

Python 2.7 Not plotting extremes based on slope in pandas dataframe

Basically, I want to not plot extremes in my graph. I thought doing this based on the slope of the graph would be a good idea, but for some reason I keep getting the error that the dates on my x-axis do not exist (DataFrame has no attribute Datumtijd). (Edit: Removed file location as question has been answered)
from pylab import *
import matplotlib.pyplot as plt
import matplotlib.dates as pld
%matplotlib inline
import pandas as pd
from pandas import DataFrame
pbn135 = pd.read_csv('3873_135.csv', parse_dates=[0], index_col = 0, dayfirst = True, delimiter = ';', usecols = ['Datumtijd','DisplayWaarde'])
for i in range(len(pbn135)):
slope = (pbn135.DisplayWaarde[i+1]-pbn135.DisplayWaarde[i])/(pbn135.Datumtijd[i+1]-pbn135.Datumtijd[i])
Python can't operate with DateTime. Converting the DateTime to an integer works. Usually done by calculating the total seconds from a reference date (e.g. 1 jan 2015).
This is done by importing datetime from datetime. Then by setting a reference date datetime(2015,1,1) the seconds are calculted with total_seconds().
However this does create a slope where the interval is in seconds and not the interval of your datetime. If anyone knows how to fix that without manually entering a division please let us know
from datetime import datetime
for i in range(len(pbn135)):
slope = (pbn135.pbn73[i+1]-pbn135.pbn73[i])/((pbn135.index[i+1]-datetime(2015,1,1)).total_seconds()-(pbn135.index[i]-datetime(2015,1,1)).total_seconds())
print slope

scikit-learn PCA doesn't have 'score' method

I am trying to identify the type of noise based on that article:
Model selection with Probabilistic (PCA) and Factor Analysis (FA)
I am using scikit-learn-0.14.1.win32-py2.7 on win8 64bit
I know that it refers on version 0.15, however at the version 0.14 documentation it mentions that the score method is available for PCA so I guess it should normally work:
The problem is that no matter which PCA I will use for the *cross_val_score*, I always get a type error message saying that the estimator PCA does not have a score method:
*TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator PCA(copy=True, n_components=None, whiten=False) does not.*
Any ideas why is that happening?
Many thanks in advance
X has 1000 samples of 40 features
here is a portion of the code:
import numpy as np
import csv
from scipy import linalg
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.covariance import ShrunkCovariance, LedoitWolf
#read in the training data
train_path = '<train data path>/train.csv'
reader = csv.reader(open(train_path,"rb"),delimiter=',')
train = list(reader)
X = np.array(train).astype('float')
n_samples = 1000
n_features = 40
n_components = np.arange(0, n_features, 4)
def compute_scores(X):
pca = PCA()
pca_scores = []
for n in n_components:
pca.n_components = n
pca_scores.append(np.mean(cross_val_score(pca, X, n_jobs=1)))
return pca_scores
pca_scores = compute_scores(X)
n_components_pca = n_components[np.argmax(pca_scores)]
Ok, I think I found the problem. it is not working with PCA, but it does work with PPCA
However, by not providing a cv number the cross_val_score automatically sets 3-fold cross validation
that created 3 sets with sizes 334, 333 and 333 (my initial training set contains 1000 samples)
Since nympy.mean cannot make a comparison between sets with different sizes (334 vs 333), python rises an exception.