PVLIB - DC Power From Irradiation - Simple Calculation - pvlib

Dear pvlib users and devels.
I'm a researcher in computer science, not particularly expert in the simulation or modelling of solar panels. I'm interested in use pvlib since
we are trying to simulate the works of a small solar panel used for IoT
applications, in particular the panel spec are the following:
12.8% max efficiency, Vmp = 5.82V, size = 225 × 155 × 17 mm.
Before using pvlib, one of my collaborator wrote a code that compute the
irradiation directly from average monthly values calculated with PVWatt.
I was not really satisfied, so we are starting to use pvlib.
In the old code, we have the power and current of the panel calculated as:
W = Irradiation * PanelSize(m^2) * Efficiency
A = W / Vmp
The Irradiation, in Madrid, as been obtained with PVWatt, and this is
what my collaborator used:
DIrradiance = (2030.0,2960.0,4290.0,5110.0,5950.0,7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
I'm trying to understand if pvlib compute values similar to the ones above, as averages over a day for each month. And the curve of production in day.
I wrote this to compare pvlib with our old model:
import math
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import pandas as pd
import pvlib
from pvlib.location import Location
def irradiance(day,m):
DIrradiance =(2030.0,2960.0,4290.0,5110.0,5950.0,
7090.0,7200.0,6340.0,4870.0,3130.0,2130.0,1700.0)
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,m,day,00,00),
end=dt.datetime(2015,m,day,23,59),
freq='60min')
spaout = pvlib.solarposition.spa_python(times, madrid.latitude, madrid.longitude)
spaout = spaout.assign(cosz=pd.Series(np.cos(np.deg2rad(spaout['zenith']))))
z = np.array(spaout['cosz'])
return z.clip(0)*(DIrradiance[m-1])
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start = dt.datetime(2015,8,15,00,00),
end = dt.datetime(2015,8,15,23,59),
freq='60min')
old = irradiance(15,8) # old model
new = madrid.get_clearsky(times) # pvlib irradiance
plt.plot(old,'r-') # compare them.
plt.plot(old/6.0,'y-') # old seems 6 times more..I do not know why
plt.plot(new['ghi'].values,'b-')
plt.show()
The code above compute the old irradiance, using the zenit angle. and compute the ghi values using the clear_sky. I do not understand if the values in ghi must be multiplied by the cos of zenit too, or not. Anyway
they are smaller by a factor of 6. What I'd like to have at the end is the
power and current in output from the panel (DC) without any inverter, and
we are not really interested at modelling it exactly, but at least, to
have a reasonable curve. We are able to capture from the panel the ampere
produced, and we want to compare the values from the measurements putting
the panel on the roof top with the values calculated by pvlib.
Any help on this would be really appreachiated. Thanks
Sorry Will I do not care a lot about my previous model since I'd like to move all code to pvlib. I followed your suggestion and I'm using irradiance.total_irrad, the code now looks in this way:
madrid = Location(40.42, -3.70, 'Europe/Madrid', 600, 'Madrid')
times = pd.date_range(start=dt.datetime(2015,1,1,00,00),
end=dt.datetime(2015,1,1,23,59),
freq='60min')
ephem_data = pvlib.solarposition.spa_python(times, madrid.latitude,
madrid.longitude)
irrad_data = madrid.get_clearsky(times)
AM = atmosphere.relativeairmass(ephem_data['apparent_zenith'])
total = irradiance.total_irrad(40, 180,
ephem_data['apparent_zenith'], ephem_data['azimuth'],
dni=irrad_data['dni'], ghi=irrad_data['ghi'],
dhi=irrad_data['dhi'], airmass=AM,
surface_type='urban')
poa = total['poa_global'].values
Now, I know the irradiance on POA, and I want to compute the output in Ampere: It is just
(poa*PANEL_EFFICIENCY*AREA) / VOLT_OUTPUT ?

It's not clear to me how you arrived at your values for DIrradiance or what the units are, so I can't comment much the discrepancies between the values. I'm guessing that it's some kind of monthly data since there are 12 values. If so, you'd need to calculate ~hourly pvlib irradiance data and then integrate it to check for consistency.
If your module will be tilted, you'll need to convert your ~hourly irradiance GHI, DNI, DHI values to plane of array (POA) irradiance using a transposition model. The irradiance.total_irrad function is the easiest way to do that.
The next steps depend on the IV characteristics of your module, the rest of the circuit, and how accurate you need the model to be.

Related

Pvlib / Bird1984: North-facing element shows negative Irradiance

When using pvlib (but also the spectrl2 implementation provided by NREL), I obtain negative Irradiance for a north-facing panel.
Is this expected behaviour? Should the spectrum simply be cut at zero?
Added example code based on the tutorial below:
## Using PV Lib
from pvlib import spectrum, solarposition, irradiance, atmosphere
import pandas as pd
import matplotlib.pyplot as plt
# assumptions from the technical report:
lat = 49.88
lon = 8.63
tilt = 45
azimuth = 0 # North = 0
pressure = 101300 # sea level, roughly
water_vapor_content = 0.5 # cm
tau500 = 0.1
ozone = 0.31 # atm-cm
albedo = 0.2
times = pd.date_range('2021-11-30 8:00', freq='h', periods=6, tz="Europe/Berlin") # , tz='Etc/GMT+9'
solpos = solarposition.get_solarposition(times, lat, lon)
aoi = irradiance.aoi(tilt, azimuth, solpos.apparent_zenith, solpos.azimuth)
# The technical report uses the 'kasten1966' airmass model, but later
# versions of SPECTRL2 use 'kastenyoung1989'. Here we use 'kasten1966'
# for consistency with the technical report.
relative_airmass = atmosphere.get_relative_airmass(solpos.apparent_zenith,
model='kasten1966')
spectra = spectrum.spectrl2(
apparent_zenith=solpos.apparent_zenith,
aoi=aoi,
surface_tilt=tilt,
ground_albedo=albedo,
surface_pressure=pressure,
relative_airmass=relative_airmass,
precipitable_water=water_vapor_content,
ozone=ozone,
aerosol_turbidity_500nm=tau500,
)
plt.figure()
plt.plot(spectra['wavelength'], spectra['poa_global'])
plt.xlim(200, 2700)
# plt.ylim(0, 1.8)
plt.title(r"2021-11-30, Darmstadt, $\tau=0.1$, Wv=0.5 cm")
plt.ylabel(r"Irradiance ($W m^{-2} nm^{-1}$)")
plt.xlabel(r"Wavelength ($nm$)")
time_labels = times.strftime("%H:%M %p")
labels = [
"AM {:0.02f}, Z{:0.02f}, {}".format(*vals)
for vals in zip(relative_airmass, solpos.apparent_zenith, time_labels)
]
plt.legend(labels)
plt.show()
No, this is not expected behavior. I suspect the issue is caused by improper handling of angle-of-incidence values greater than 90 degrees, and essentially the same problem (for a different function) discussed here: https://github.com/pvlib/pvlib-python/issues/526
It's unfortunate that the reference implementation from NREL has the problem too (perhaps when the model was originally designed, nobody could conceive of a panel facing away from the sun!), but I think the pvlib implementation should be fixed regardless. I encourage you to file a bug report here: https://github.com/pvlib/pvlib-python/issues
In the meantime, I think you can resolve the issue in your own code by adding a line like aoi[aoi > 90] = 90 prior to passing it to spectrum.spectrl2, although be careful about this if you end up using aoi for other purposes later in the script. I would be interested to hear if the resulting spectra are consistent with your expectations.
Edit for posterity: a github issue has been opened here: https://github.com/pvlib/pvlib-python/issues/1348

Time variable units "day as %Y%m%d.%f" in python iris

I am hoping someone can help. I am running a few climate models (NetCDF files) in python using iris. All was working well until I added my last model which is formatted differently. The units they use for the time variable in the new models is day as %Y%m%d.%f but in the other models it is days since …. This means that when I try to constrain the time variable I get the following error AttributeError: 'numpy.float64' object has no attribute 'year'.
I tried adding a year variable using iriscc.add_year(EARTH3, 'time') but that just brings up the error ‘Unit has undefined calendar’.
I’m wondering if you know how I might fix this? Do I need to convert the calendar type? Or is there is there a way around that? Not sure how to do that anyway!
Thank you!
Erika
EDIT: here is the full code for my file the model CanESM2 is working, but the model EARTH3 is not - it is the one with the funny time units.
import matplotlib.pyplot as plt
import iris
import iris.coord_categorisation as iriscc
import iris.plot as iplt
import iris.quickplot as qplt
import iris.analysis.cartography
import cf_units
from cf_units import Unit
import datetime
import numpy as np
def main():
#-------------------------------------------------------------------------
#bring in all the GCM models we need and give them a name
CanESM2= '/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/GCM_data/tasmin_Amon_CanESM2_historical_r1i1p1_185001-200512.nc'
EARTH3= '/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/GCM_data/tas_Amon_EC-EARTH_historical_r3i1p1_1850-2009.nc'
#Load exactly one cube from given file
CanESM2 = iris.load_cube(CanESM2)
EARTH3 = iris.load_cube(EARTH3)
print"CanESM2 time"
print (CanESM2.coord('time'))
print "EARTH3 time"
print (EARTH3.coord('time'))
#fix EARTH3 time units as they differ from all other models
t_coord=EARTH3.coord('time')
t_unit = t_coord.attributes['invalid_units']
timestep, _, t_fmt_str = t_unit.split(' ')
new_t_unit_str= '{} since 1850-01-01 00:00:00'.format(timestep)
new_t_unit = cf_units.Unit(new_t_unit_str, calendar=cf_units.CALENDAR_STANDARD)
new_datetimes = [datetime.datetime.strptime(str(dt), t_fmt_str) for dt in t_coord.points]
new_dt_points = [new_t_unit.date2num(new_dt) for new_dt in new_datetimes]
new_t_coord = iris.coords.DimCoord(new_dt_points, standard_name='time', units=new_t_unit)
print "EARTH3 new time"
print (EARTH3.coord('time'))
#regrid all models to have same latitude and longitude system, all regridded to model with lowest resolution
CanESM2 = CanESM2.regrid(CanESM2, iris.analysis.Linear())
EARTH3 =EARTH3.regrid(CanESM2, iris.analysis.Linear())
#we are only interested in the latitude and longitude relevant to Malawi (has to be slightly larger than country boundary to take into account resolution of GCMs)
Malawi = iris.Constraint(longitude=lambda v: 32.0 <= v <= 36., latitude=lambda v: -17. <= v <= -8.)
CanESM2 =CanESM2.extract(Malawi)
EARTH3 =EARTH3.extract(Malawi)
#time constraignt to make all series the same, for ERAINT this is 1990-2008 and for RCMs and GCMs this is 1961-2005
iris.FUTURE.cell_datetime_objects = True
t_constraint = iris.Constraint(time=lambda cell: 1961 <= cell.point.year <= 2005)
CanESM2 =CanESM2.extract(t_constraint)
EARTH3 =EARTH3.extract(t_constraint)
#Convert units to match, CORDEX data is in Kelvin but Observed data in Celsius, we would like to show all data in Celsius
CanESM2.convert_units('Celsius')
EARTH3.units = Unit('Celsius') #this fixes EARTH3 which has no units defined
EARTH3=EARTH3-273 #this converts the data manually from Kelvin to Celsius
#add year data to files
iriscc.add_year(CanESM2, 'time')
iriscc.add_year(EARTH3, 'time')
#We are interested in plotting the data by year, so we need to take a mean of all the data by year
CanESM2YR=CanESM2.aggregated_by('year', iris.analysis.MEAN)
EARTH3YR = EARTH3.aggregated_by('year', iris.analysis.MEAN)
#Returns an array of area weights, with the same dimensions as the cube
CanESM2YR_grid_areas = iris.analysis.cartography.area_weights(CanESM2YR)
EARTH3YR_grid_areas = iris.analysis.cartography.area_weights(EARTH3YR)
#We want to plot the mean for the whole region so we need a mean of all the lats and lons
CanESM2YR_mean = CanESM2YR.collapsed(['latitude', 'longitude'], iris.analysis.MEAN, weights=CanESM2YR_grid_areas)
EARTH3YR_mean = EARTH3YR.collapsed(['latitude', 'longitude'], iris.analysis.MEAN, weights=EARTH3YR_grid_areas)
#-------------------------------------------------------------------------
#PART 4: PLOT LINE GRAPH
#limit x axis
plt.xlim((1961,2005))
#assign the line colours and set x axis to 'year' rather than 'time'
qplt.plot(CanESM2YR_mean.coord('year'), CanESM2YR_mean, label='CanESM2', lw=1.5, color='blue')
qplt.plot(EARTH3YR_mean.coord('year'), EARTH3YR_mean, label='EC-EARTH (r3i1p1', lw=1.5, color='magenta')
#set a title for the y axis
plt.ylabel('Near-Surface Temperature (degrees Celsius)')
#create a legend and set its location to under the graph
plt.legend(loc="upper center", bbox_to_anchor=(0.5,-0.05), fancybox=True, shadow=True, ncol=2)
#create a title
plt.title('Tas for Malawi 1961-2005', fontsize=11)
#add grid lines
plt.grid()
#show the graph in the console
iplt.show()
if __name__ == '__main__':
main()
In Iris, unit strings for time coordinates must be specified in the format <time-period> since <epoch>, where <time-period> is a unit of measure of time, such as 'days', or 'years'. This format is specified by udunits2, the library Iris uses to supply valid units and perform unit conversions.
The time coordinate in this case does not have a unit that follows this format, meaning the time coordinate will not have full time coordinate functionality (this partly explains the Attribute Error in the question). To fix this we will need to construct a new time coordinate based on the values and metadata of the existing time coordinate and then replace the cube's existing time coordinate with the new one.
To do this we'll need to:
construct a new time unit based on the metadata contained in the existing time unit
take the existing time coordinate's point values and format them as datetime objects, using the format string specified in the existing time unit
convert the datetime objects from (2.) to an array of floating-point numbers using the new time unit constructed in (1.)
create a new time coordinate from the array constructed in (3.) and the new time unit produced in (1.)
remove the old time coordinate from the cube and add the new one.
Here's the code to do this...
import datetime
import cf_units
import iris
import numpy as np
t_coord = EARTH3.coord('time')
t_unit = t_coord.attributes['invalid_units']
timestep, _, t_fmt_str = t_unit.split(' ')
new_t_unit_str = '{} since 1850-01-01 00:00:00'.format(timestep)
new_t_unit = cf_units.Unit(new_t_unit_str, calendar=cf_units.CALENDAR_STANDARD)
new_datetimes = [datetime.datetime.strptime(str(dt), t_fmt_str) for dt in t_coord.points]
new_dt_points = [new_t_unit.date2num(new_dt) for new_dt in new_datetimes]
new_t_coord = iris.coords.DimCoord(new_dt_points, standard_name='time', units=new_t_unit)
t_coord_dim = cube.coord_dims('time')
cube.remove_coord('time')
cube.add_dim_coord(new_t_coord, t_coord_dim)
I've made an assumption about the best epoch for your time data. I've also made an assumption about the calendar that best describes your data, but you should be able to replace (when constructing new_t_unit) the standard calendar I've chosen with any other valid cf_units calendar without difficulty.
As a final note, it is effectively impossible to change calendar types. This is because different calendar types include and exclude different days. For example, a 360day calendar has a Feb 30 but no May 31 (as it assumes 12 idealised 30 day long months). If you try and convert from a 360day calendar to a standard calendar, problems you hit include what you do with the data from 29 and 30 Feb, and how you fill the five missing days that don't exist in a 360day calendar. For such reasons it's generally impossible to convert calendars (and Iris doesn't allow such operations).
Hope this helps!
Maybe the answer is not more useful however I write here the function that I made in order to convert the data from %Y%m%d.%f in datetime array.
The function create a perfect datetime array, without missing values, it can be modified to take into account if there are missing times, however a climate model should not have missing data.
def fromEARTHtime2Datetime(dt,timeVecEARTH):
"""
This function returns the perfect array from the EARTH %Y%m%d.%f time
format and convert it to a more useful time, such as the time array
from the datetime of pyhton, this is WHTOUT any missing data!
Parameters
----------
dt : string
This is the time discretization, it can be 1h or 6h, but always it
needs to be hours, example dt = '6h'.
timeVecEARTH : array of float
Vector of the time to be converted. For example the time of the
EARTH is day as %Y%m%d.%f.
And only this format can be converted to datetime, for example:
20490128.0,20490128.25,20490128.5,20490128.75 this will be converted
in datetime: '2049-01-28 00:00:00', '2049-01-28 60:00:00',
'2049-01-28 12:00:00','2049-01-28 18:00:00'
Returns
-------
timeArrNew : datetime
This is the perfect and WITHOUT any missing data datatime array,
for example: DatetimeIndex(['2049-01-28 00:00:00', '2049-01-28 06:00:00',
...
'2049-02-28 18:00:00', '2049-03-01 00:00:00'],
dtype='datetime64[ns]', length=129, freq='6H')
"""
dtDay = 24/np.float(dt[:-1])
partOfDay = np.arange(0,1,1/dtDay)
hDay = []
for ip in partOfDay:
hDay.append('%02.f:00:00' %(24*ip))
dictHours = dict(zip(partOfDay,hDay))
t0Str = str(timeVecEARTH[0])
timeAux0 = t0Str.split('.')
timeAux0 = timeAux0[0][0:4] +'-' + timeAux0[0][4:6] +'-' + timeAux0[0][6:] + ' ' + dictHours[float(timeAux0[1])]
tendStr = str(timeVecEARTH[-1])
timeAuxEnd = tendStr.split('.')
timeAuxEnd = timeAuxEnd[0][0:4] +'-' + timeAuxEnd[0][4:6] +'-' + timeAuxEnd[0][6:] + ' ' + dictHours[float(timeAuxEnd[1])]
timeArrNew = pd.date_range(timeAux0,timeAuxEnd, freq=dt)
return timeArrNew

Parseval's Theorem does not hold for FFT of a sinusoid + noise?

Thanks in advance for any help on this subject. I've recently been trying to work out Parseval's theorem for discrete fourier transforms when noise is included. I based my code from this code.
What I expected to see is that (as when no noise is included) the total power in the frequency domain is half that of the total power in the time-domain, as I have cut off the negative frequencies.
However, as more noise is added to the time-domain signal, the total power of the fourier transform of the signal+noise becomes much less than half of the total power of the signal+noise.
My code is as follows:
import numpy as np
import numpy.fft as nf
import matplotlib.pyplot as plt
def findingdifference(randomvalues):
n = int(1e7) #number of points
tmax = 40e-3 #measurement time
f1 = 30e6 #beat frequency
t = np.linspace(-tmax,tmax,num=n) #define time axis
dt = t[1]-t[0] #time spacing
gt = np.sin(2*np.pi*f1*t)+randomvalues #make a sin + noise
fftfreq = nf.fftfreq(n,dt) #defining frequency (x) axis
hkk = nf.fft(gt) # fourier transform of sinusoid + noise
hkn = nf.fft(randomvalues) #fourier transform of just noise
fftfreq = fftfreq[fftfreq>0] #only taking positive frequencies
hkk = hkk[fftfreq>0]
hkn = hkn[fftfreq>0]
timedomain_p = sum(abs(gt)**2.0)*dt #parseval's theorem for time
freqdomain_p = sum(abs(hkk)**2.0)*dt/n # parseval's therom for frequency
difference = (timedomain_p-freqdomain_p)/timedomain_p*100 #percentage diff
tdomain_pn = sum(abs(randomvalues)**2.0)*dt #parseval's for time
fdomain_pn = sum(abs(hkn)**2.0)*dt/n # parseval's for frequency
difference_n = (tdomain_pn-fdomain_pn)/tdomain_pn*100 #percent diff
return difference,difference_n
def definingvalues(max_amp,length):
noise_amplitude = np.linspace(0,max_amp,length) #defining noise amplitude
difference = np.zeros((2,len(noise_amplitude)))
randomvals = np.random.random(int(1e7)) #defining noise
for i in range(len(noise_amplitude)):
difference[:,i] = (findingdifference(noise_amplitude[i]*randomvals))
return noise_amplitude,difference
def figure(max_amp,length):
noise_amplitude,difference = definingvalues(max_amp,length)
plt.figure()
plt.plot(noise_amplitude,difference[0,:],color='red')
plt.plot(noise_amplitude,difference[1,:],color='blue')
plt.xlabel('Noise_Variable')
plt.ylabel(r'Difference in $\%$')
plt.show()
return
figure(max_amp=3,length=21)
My final graph looks like this figure. Am I doing something wrong when working this out? Is there an physical reason that this trend occurs with added noise? Is it to do with doing a fourier transform on a not perfectly sinusoidal signal? The reason I am doing this is to understand a very noisy sinusoidal signal that I have real data for.
Parseval's theorem holds in general if you use the whole spectrum (positive and negative) frequencies to compute the power.
The reason for the discrepancy is the DC (f=0) component, which is treated somewhat special.
First, where does the DC component come from? You use np.random.random to generate random values between 0 and 1. So on average you raise the signal by 0.5*noise_amplitude, which entails a lot of power. This power is correctly computed in the time domain.
However, in the frequency domain, there is only a single FFT bin that corresponds to f=0. The power of all other frequencies is distributed over two bins, only the DC power is contained in a single bin.
By scaling the noise you add DC power. By removing the negative frequencies you remove half the signal power, but most of the noise power is located in the DC component which is used fully.
You have several options:
Use all frequencies to compute the power.
Use noise without a DC component: randomvals = np.random.random(int(1e7)) - 0.5
"Fix" the power calculation by removing half of the DC power: hkk[fftfreq==0] /= np.sqrt(2)
I'd go with option 1. The second might be OK and I don't really recommend 3.
Finally, there is a minor problem with the code:
fftfreq = fftfreq[fftfreq>0] #only taking positive frequencies
hkk = hkk[fftfreq>0]
hkn = hkn[fftfreq>0]
This does not really make sense. Better change it to
hkk = hkk[fftfreq>=0]
hkn = hkn[fftfreq>=0]
or completely remove it for option 1.

Remove weekends in finance plots with volume overlay [duplicate]

I've been having some difficulty with Matplotlib's finance charting. It seems like their candlestick charts work best with daily data, and I am having a hard time making them work with intraday (every 5 minutes, between 9:30 and 4 pm) data.
I have pasted sample data in pastebin. The top is what I get from the database, and the bottom is tupled with the date formatted into an ordinal float for use in Matplotlib.
Link to sample data
When I draw my charts there are huge gaps in it, the axes suck, and the zoom is equally horrible. http://imgur.com/y7O8A
How do I make a nice readable graph out of this data? My ultimate goal is to get a chart that looks remotely like this:
http://i.imgur.com/EnrTW.jpg
The data points can be in various increments from 5 minutes to 30 minutes.
I have also made a Pandas dataframe of the data, but I am not sure if pandas has candlestick functionality.
If I understand well, one of your major concern is the gaps between the daily data.
To get rid of them, one method is to artificially 'evenly space' your data (but of course you will loose any temporal indication intra-day).
Anyways, doing this way, you will be able to obtain a chart that looks like the one you have proposed as an example.
The commented code and the resulting graph are below.
import numpy as np
import matplotlib.pyplot as plt
import datetime
from matplotlib.finance import candlestick
from matplotlib.dates import num2date
# data in a text file, 5 columns: time, opening, close, high, low
# note that I'm using the time you formated into an ordinal float
data = np.loadtxt('finance-data.txt', delimiter=',')
# determine number of days and create a list of those days
ndays = np.unique(np.trunc(data[:,0]), return_index=True)
xdays = []
for n in np.arange(len(ndays[0])):
xdays.append(datetime.date.isoformat(num2date(data[ndays[1],0][n])))
# creation of new data by replacing the time array with equally spaced values.
# this will allow to remove the gap between the days, when plotting the data
data2 = np.hstack([np.arange(data[:,0].size)[:, np.newaxis], data[:,1:]])
# plot the data
fig = plt.figure(figsize=(10, 5))
ax = fig.add_axes([0.1, 0.2, 0.85, 0.7])
# customization of the axis
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.tick_params(axis='both', direction='out', width=2, length=8,
labelsize=12, pad=8)
ax.spines['left'].set_linewidth(2)
ax.spines['bottom'].set_linewidth(2)
# set the ticks of the x axis only when starting a new day
ax.set_xticks(data2[ndays[1],0])
ax.set_xticklabels(xdays, rotation=45, horizontalalignment='right')
ax.set_ylabel('Quote ($)', size=20)
ax.set_ylim([177, 196])
candlestick(ax, data2, width=0.5, colorup='g', colordown='r')
plt.show()
I got tired of matplotlib's (and plotly's) bad performance and lack of such features you request, so implemented one of my own. Here's how that works:
import finplot as fplt
import yfinance
df = yfinance.download('AAPL')
fplt.candlestick_ochl(df[['Open', 'Close', 'High', 'Low']])
fplt.show()
Not only are days in which the exchange is closed left out automatically, but also has better performance and a nicer api. For something that more resembles what you're ultimately looking for:
import finplot as fplt
import yfinance
symbol = 'AAPL'
df = yfinance.download(symbol)
ax = fplt.create_plot(symbol)
fplt.candlestick_ochl(df[['Open', 'Close', 'High', 'Low']], ax=ax)
fplt.plot(df['Close'].rolling(200).mean(), ax=ax, legend='SMA 200')
fplt.plot(df['Close'].rolling(50).mean(), ax=ax, legend='SMA 50')
fplt.plot(df['Close'].rolling(20).mean(), ax=ax, legend='SMA 20')
fplt.volume_ocv(df[['Open', 'Close', 'Volume']], ax=ax.overlay())
fplt.show()

NLTK package to estimate the (unigram) perplexity

I am trying to calculate the perplexity for the data I have. The code I am using is:
import sys
sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")
from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm
But I am receiving the error,
File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'
I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1).
My unigrams and their probability looks like:
Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781
This is just a fragment of the unigrams file I have. The same format is followed for about 1000s of lines. The total probabilities (second column) summed gives 1.
I am a budding programmer. This ngram.py belongs to the nltk package and I am confused as to how to rectify this. The sample code I have here is from the nltk documentation and I don't know what to do now. Please help on what I can do. Thanks in advance!
Perplexity is the inverse probability of the test set, normalized by the number of words. In the case of unigrams:
Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. Then you only need to apply the formula. I assume you have a big dictionary unigram[word] that would provide the probability of each word in the corpus. You also need to have a test set. If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly.
perplexity = 1
N = 0
for word in testset:
if word in unigram:
N += 1
perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))
UPDATE:
As you asked for a complete working example, here's a very simple one.
Suppose this is our corpus:
corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""
Here's how we construct the unigram model first:
import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)
#here you construct the unigram language model
def unigram(tokens):
model = collections.defaultdict(lambda: 0.01)
for f in tokens:
try:
model[f] += 1
except KeyError:
model [f] = 1
continue
N = float(sum(model.values()))
for word in model:
model[word] = model[word]/N
return model
Our model here is smoothed. For words outside the scope of its knowledge, it assigns a low probability of 0.01. I already told you how to compute perplexity:
#computes perplexity of the unigram model on a testset
def perplexity(testset, model):
testset = testset.split()
perplexity = 1
N = 0
for word in testset:
N += 1
perplexity = perplexity * (1/model[word])
perplexity = pow(perplexity, 1/float(N))
return perplexity
Now we can test this on two different test sets:
testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"
model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)
for which you get the following result:
>>>
49.09452736318415
99.99999999999997
Note that when dealing with perplexity, we try to reduce it. A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. In the first test set, the word Monty was included in the unigram model, so the respective number for perplexity was also smaller.
Thanks for the code snippet! Shouldn't:
for word in model:
model[word] = model[word]/float(sum(model.values()))
be rather:
v = float(sum(model.values()))
for word in model:
model[word] = model[word]/v
Oh ... I see was already answered ...