PVLIB: How can I add module and inverter specifications which are not present in CEC and SAM library? - pvlib

I am working on a PV system installed in Amsterdam. The PVsystem code is as follows. I am getting good results with the inverter and the modules specified in the code which is obtained with retrieve_sam.
import pvlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pvlib.temperature import TEMPERATURE_MODEL_PARAMETERS
from pandas.plotting import register_matplotlib_converters
from pvlib.modelchain import ModelChain
# Define location for the Netherlands
location = pvlib.location.Location(latitude=52.53, longitude=5.15, tz='UTC', altitude=50, name='amsterdam')
#import the database
module_database = pvlib.pvsystem.retrieve_sam(name='SandiaMod')
inverter_database = pvlib.pvsystem.retrieve_sam(name='cecinverter')
module = module_database.Canadian_Solar_CS5P_220M___2009_
# module = module_database.DMEGC_Solar_320_M6_120BB_ (I want to add this module)
inverter = inverter_database.ABB__PVI_3_0_OUTD_S_US__208V_
temperature_model_parameters = pvlib.temperature.TEMPERATURE_MODEL_PARAMETERS['sapm']['open_rack_glass_glass']
modules_per_string = 10
inverter_per_string = 1
# Define a PV system characteristics
surface_tilt = 12.5
surface_azimuth = 180
system = pvlib.pvsystem.PVSystem(surface_tilt=surface_tilt, surface_azimuth=surface_azimuth, albedo=0.25,
module=module, module_parameters=module,
temperature_model_parameters=temperature_model_parameters,
modules_per_string=modules_per_string, inverter_per_string=inverter_per_string,
inverter=inverter, inverter_parameters=inverter, racking_model='open_rack')
# Define a weather file
def importPSMData():
df = pd.read_csv('/Users/laxmikantradkar/Desktop/PVLIB/solcast_data1.csv', delimiter=';')
# Rename the columns for input to PVLIB
df.rename(columns={'Dhi': 'dhi', 'Dni': 'dni', 'Ghi': 'ghi', 'AirTemp': 'temp_air', 'WindSpeed10m': 'wind_speed',
}, inplace=True)
df.rename(columns={'Year': 'year', 'Month': 'month', 'Day': 'day', 'Hour': 'hour',
'Minute': 'minute'}, inplace=True)
df['dt'] = pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute']])
df.set_index(df['dt'], inplace=True)
# Rename data parameters to run to datetime
# df.rename(columns={'PeriodEnd': 'period_end'}, inplace=True)
# Drop unnecessary columns
df = df.drop('PeriodStart', 1)
df = df.drop('Period', 1)
df = df.drop('Azimuth', 1)
df = df.drop('CloudOpacity', 1)
df = df.drop('DewpointTemp', 1)
df = df.drop('Ebh', 1)
df = df.drop('PrecipitableWater', 1)
df = df.drop('SnowDepth', 1)
df = df.drop('SurfacePressure', 1)
df = df.drop('WindDirection10m', 1)
df = df.drop('Zenith', 1)
return df
mc = ModelChain(system=system, location=location)
weatherData = importPSMData()
mc.run_model(weather=weatherData)
ac_energy = mc.ac
# ac_energy.to_csv('/Users/laxmikantradkar/Desktop/ac_energy_netherlands.csv')
plt.plot(ac_energy)
plt.show()
Now I want to change the module and inverter which is not present in the library. Could anyone please tell me how to do this?
Is it possible to access the library and manually add the row/column of inverter and module? If yes, where is the library located?
Is it ../Desktop/PVLIB/venv/lib/python3.8/site-packages/pvlib/data/sam-library-sandia-modules-2015-6-30.csv
When I change try to change the module/inverter parameters from above path, I receive an error as DataFrame' object has no attribute 'Module name'
I started working on PVLIB_python 2 days ago, so I am new to the language. I really appreciate your help. Feel free to correct me at any point.

I started working on PVLIB_python 2 days ago, so I am new to the
language. I really appreciate your help. Feel free to correct me at
any point.
Welcome to the community! If you haven't already I encourage you to dig through the pvlib-python documentation and continue to learn Python basics through playing with the examples in the documentation. I encourage you to checkout the pandas tutorials and any other highly rated pandas learning material you can find to get yourself running with data science in Python.
When I change try to change the module/inverter parameters from above
path, I receive an error as DataFrame' object has no attribute 'Module
name'
This is because you're asking for a column in the DataFrame table that's not there. No worries, you can make your own module.
Now I want to change the module and inverter which is not present in
the library. Could anyone please tell me how to do this? Is it possible to access the library and manually add the row/column
of inverter and module? If yes, where is the library located?
It isn't necessary to change the library. You can construct a module yourself since it is a Series from the pandas library. Here's an example showing how you can output the module as a dictionary, change a couple parameters and create your own module.
my_new_module = module.copy() # create your own copy of the module
print("Before:", my_new_module, sep="\n") # show module before
my_new_module["Notes"] = "This is how to change a field in the module. Do this for every field in the module."
my_new_module.name = "DMEGC_Solar_320_M6_120BB_" # rename the Series appropriately
print("\nAfter:", my_new_module, sep="\n") # show module after
Then you can just insert "my_new_module" into PVSystem:
system = pvlib.pvsystem.PVSystem(
surface_tilt=surface_tilt,
surface_azimuth=surface_azimuth,
albedo=0.25,
module=my_new_module, # HERE'S THE NEW MODULE!
module_parameters=module,
temperature_model_parameters=temperature_model_parameters,
modules_per_string=modules_per_string,
inverter_per_string=inverter_per_string,
inverter=inverter,
inverter_parameters=inverter,
racking_model='open_rack')
The hard part here is having the right coefficients that you can trust. You may have an easier time using module_database = pvlib.pvsystem.retrieve_sam(name='CECMod') and replacing those parameters since they can be substituted more easily with data from the module spec sheet.
This should work identically for inverters as well.

Related

XGboost Google-AI-Model expecting float values instead of using Categorical values and converting them

I'm trying to run a simple XGBoost Prediction based on Google Cloud using this simple example https://cloud.google.com/ml-engine/docs/scikit/getting-predictions-xgboost#get_online_predictions
The model is building fine, but when I try to run a prediction with a sample input JSON it fails with error "Could not initialize DMatrix from inputs: could not convert string to float:" as shown in the screen below. I understand this is happening because the test-input has strings, I was hoping the Google machine learning model should have information to convert the categorical values to floats. I cannot expect my user to submit-online-prediction-request with float values.
Based on the tutorial it should work without converting the categorical values to floats. Please advise, I have attached the GIF with more details. Thanks
import json
import numpy as np
import os
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
# these are the column labels from the census data files
COLUMNS = (
'age',
'workclass',
'fnlwgt',
'education',
'education-num',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'capital-gain',
'capital-loss',
'hours-per-week',
'native-country',
'income-level'
)
# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country'
)
# load training set
with open('./census_data/adult.data', 'r') as train_data:
raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
# remove column we are trying to predict ('income-level') from features list
train_features = raw_training_data.drop('income-level', axis=1)
# create training labels list
train_labels = (raw_training_data['income-level'] == ' >50K')
# load test set
with open('./census_data/adult.test', 'r') as test_data:
raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)
# remove column we are trying to predict ('income-level') from features list
test_features = raw_testing_data.drop('income-level', axis=1)
# create training labels list
test_labels = (raw_testing_data['income-level'] == ' >50K.')
# convert data in categorical columns to numerical values
encoders = {col:LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
train_features[col] = encoders[col].fit_transform(train_features[col])
for col in CATEGORICAL_COLUMNS:
test_features[col] = encoders[col].fit_transform(test_features[col])
# load data into DMatrix object
dtrain = xgb.DMatrix(train_features, train_labels)
dtest = xgb.DMatrix(test_features)
# train XGBoost model
bst = xgb.train({}, dtrain, 20)
bst.save_model('./model.bst')
Here is a fix. Put the input shown in the Google documentation in a file input.json, then run this. The output is input_numerical.json and prediction will succeed if you use that in place of input.json.
This code is just preprocessing categorical columns to numerical forms using the same procedure as was done with training and test data.
import json
import pandas as pd
from sklearn.preprocessing import LabelEncoder
COLUMNS = (
"age",
"workclass",
"fnlwgt",
"education",
"education-num",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"capital-gain",
"capital-loss",
"hours-per-week",
"native-country",
"income-level",
)
# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
"workclass",
"education",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"native-country",
)
with open("./input.json", "r") as json_lines:
rows = [json.loads(line) for line in json_lines]
prediction_features = pd.DataFrame(rows, columns=(COLUMNS[:-1]))
encoders = {col: LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
prediction_features[col] = encoders[col].fit_transform(prediction_features[col])
with open("input_numerical.json", "w") as input_numerical:
for index, row in prediction_features.iterrows():
input_numerical.write(row.to_json(orient="values") + "\n")
I created this Google Issues Tracker ticket as the Google documentation is missing this important step.
You can use pandas to convert categorical strings into codes for model inputs. For prediction input you can define a dictionary for each category with corresponding category values and codes. For example, for workclass:
df['workclass_cat'] = df['workclass'].astype('category')
df['workclass_cat'] = df['workclass_cat'].cat.codes
workclass_dict = dict(zip(list(df['workclass'].values), list(df['workclass_cat'].values)))
If a prediction input is 'somestring' you can access its code as follows:
category_input = workclass_dict['somestring']
XGBoost models take floats as input. In your training script you converted the categorical variables into numbers. The same transformation needs to be done when submitting a prediction.

Time variable units "day as %Y%m%d.%f" in python iris

I am hoping someone can help. I am running a few climate models (NetCDF files) in python using iris. All was working well until I added my last model which is formatted differently. The units they use for the time variable in the new models is day as %Y%m%d.%f but in the other models it is days since …. This means that when I try to constrain the time variable I get the following error AttributeError: 'numpy.float64' object has no attribute 'year'.
I tried adding a year variable using iriscc.add_year(EARTH3, 'time') but that just brings up the error ‘Unit has undefined calendar’.
I’m wondering if you know how I might fix this? Do I need to convert the calendar type? Or is there is there a way around that? Not sure how to do that anyway!
Thank you!
Erika
EDIT: here is the full code for my file the model CanESM2 is working, but the model EARTH3 is not - it is the one with the funny time units.
import matplotlib.pyplot as plt
import iris
import iris.coord_categorisation as iriscc
import iris.plot as iplt
import iris.quickplot as qplt
import iris.analysis.cartography
import cf_units
from cf_units import Unit
import datetime
import numpy as np
def main():
#-------------------------------------------------------------------------
#bring in all the GCM models we need and give them a name
CanESM2= '/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/GCM_data/tasmin_Amon_CanESM2_historical_r1i1p1_185001-200512.nc'
EARTH3= '/exports/csce/datastore/geos/users/s0XXXX/Climate_Modelling/GCM_data/tas_Amon_EC-EARTH_historical_r3i1p1_1850-2009.nc'
#Load exactly one cube from given file
CanESM2 = iris.load_cube(CanESM2)
EARTH3 = iris.load_cube(EARTH3)
print"CanESM2 time"
print (CanESM2.coord('time'))
print "EARTH3 time"
print (EARTH3.coord('time'))
#fix EARTH3 time units as they differ from all other models
t_coord=EARTH3.coord('time')
t_unit = t_coord.attributes['invalid_units']
timestep, _, t_fmt_str = t_unit.split(' ')
new_t_unit_str= '{} since 1850-01-01 00:00:00'.format(timestep)
new_t_unit = cf_units.Unit(new_t_unit_str, calendar=cf_units.CALENDAR_STANDARD)
new_datetimes = [datetime.datetime.strptime(str(dt), t_fmt_str) for dt in t_coord.points]
new_dt_points = [new_t_unit.date2num(new_dt) for new_dt in new_datetimes]
new_t_coord = iris.coords.DimCoord(new_dt_points, standard_name='time', units=new_t_unit)
print "EARTH3 new time"
print (EARTH3.coord('time'))
#regrid all models to have same latitude and longitude system, all regridded to model with lowest resolution
CanESM2 = CanESM2.regrid(CanESM2, iris.analysis.Linear())
EARTH3 =EARTH3.regrid(CanESM2, iris.analysis.Linear())
#we are only interested in the latitude and longitude relevant to Malawi (has to be slightly larger than country boundary to take into account resolution of GCMs)
Malawi = iris.Constraint(longitude=lambda v: 32.0 <= v <= 36., latitude=lambda v: -17. <= v <= -8.)
CanESM2 =CanESM2.extract(Malawi)
EARTH3 =EARTH3.extract(Malawi)
#time constraignt to make all series the same, for ERAINT this is 1990-2008 and for RCMs and GCMs this is 1961-2005
iris.FUTURE.cell_datetime_objects = True
t_constraint = iris.Constraint(time=lambda cell: 1961 <= cell.point.year <= 2005)
CanESM2 =CanESM2.extract(t_constraint)
EARTH3 =EARTH3.extract(t_constraint)
#Convert units to match, CORDEX data is in Kelvin but Observed data in Celsius, we would like to show all data in Celsius
CanESM2.convert_units('Celsius')
EARTH3.units = Unit('Celsius') #this fixes EARTH3 which has no units defined
EARTH3=EARTH3-273 #this converts the data manually from Kelvin to Celsius
#add year data to files
iriscc.add_year(CanESM2, 'time')
iriscc.add_year(EARTH3, 'time')
#We are interested in plotting the data by year, so we need to take a mean of all the data by year
CanESM2YR=CanESM2.aggregated_by('year', iris.analysis.MEAN)
EARTH3YR = EARTH3.aggregated_by('year', iris.analysis.MEAN)
#Returns an array of area weights, with the same dimensions as the cube
CanESM2YR_grid_areas = iris.analysis.cartography.area_weights(CanESM2YR)
EARTH3YR_grid_areas = iris.analysis.cartography.area_weights(EARTH3YR)
#We want to plot the mean for the whole region so we need a mean of all the lats and lons
CanESM2YR_mean = CanESM2YR.collapsed(['latitude', 'longitude'], iris.analysis.MEAN, weights=CanESM2YR_grid_areas)
EARTH3YR_mean = EARTH3YR.collapsed(['latitude', 'longitude'], iris.analysis.MEAN, weights=EARTH3YR_grid_areas)
#-------------------------------------------------------------------------
#PART 4: PLOT LINE GRAPH
#limit x axis
plt.xlim((1961,2005))
#assign the line colours and set x axis to 'year' rather than 'time'
qplt.plot(CanESM2YR_mean.coord('year'), CanESM2YR_mean, label='CanESM2', lw=1.5, color='blue')
qplt.plot(EARTH3YR_mean.coord('year'), EARTH3YR_mean, label='EC-EARTH (r3i1p1', lw=1.5, color='magenta')
#set a title for the y axis
plt.ylabel('Near-Surface Temperature (degrees Celsius)')
#create a legend and set its location to under the graph
plt.legend(loc="upper center", bbox_to_anchor=(0.5,-0.05), fancybox=True, shadow=True, ncol=2)
#create a title
plt.title('Tas for Malawi 1961-2005', fontsize=11)
#add grid lines
plt.grid()
#show the graph in the console
iplt.show()
if __name__ == '__main__':
main()
In Iris, unit strings for time coordinates must be specified in the format <time-period> since <epoch>, where <time-period> is a unit of measure of time, such as 'days', or 'years'. This format is specified by udunits2, the library Iris uses to supply valid units and perform unit conversions.
The time coordinate in this case does not have a unit that follows this format, meaning the time coordinate will not have full time coordinate functionality (this partly explains the Attribute Error in the question). To fix this we will need to construct a new time coordinate based on the values and metadata of the existing time coordinate and then replace the cube's existing time coordinate with the new one.
To do this we'll need to:
construct a new time unit based on the metadata contained in the existing time unit
take the existing time coordinate's point values and format them as datetime objects, using the format string specified in the existing time unit
convert the datetime objects from (2.) to an array of floating-point numbers using the new time unit constructed in (1.)
create a new time coordinate from the array constructed in (3.) and the new time unit produced in (1.)
remove the old time coordinate from the cube and add the new one.
Here's the code to do this...
import datetime
import cf_units
import iris
import numpy as np
t_coord = EARTH3.coord('time')
t_unit = t_coord.attributes['invalid_units']
timestep, _, t_fmt_str = t_unit.split(' ')
new_t_unit_str = '{} since 1850-01-01 00:00:00'.format(timestep)
new_t_unit = cf_units.Unit(new_t_unit_str, calendar=cf_units.CALENDAR_STANDARD)
new_datetimes = [datetime.datetime.strptime(str(dt), t_fmt_str) for dt in t_coord.points]
new_dt_points = [new_t_unit.date2num(new_dt) for new_dt in new_datetimes]
new_t_coord = iris.coords.DimCoord(new_dt_points, standard_name='time', units=new_t_unit)
t_coord_dim = cube.coord_dims('time')
cube.remove_coord('time')
cube.add_dim_coord(new_t_coord, t_coord_dim)
I've made an assumption about the best epoch for your time data. I've also made an assumption about the calendar that best describes your data, but you should be able to replace (when constructing new_t_unit) the standard calendar I've chosen with any other valid cf_units calendar without difficulty.
As a final note, it is effectively impossible to change calendar types. This is because different calendar types include and exclude different days. For example, a 360day calendar has a Feb 30 but no May 31 (as it assumes 12 idealised 30 day long months). If you try and convert from a 360day calendar to a standard calendar, problems you hit include what you do with the data from 29 and 30 Feb, and how you fill the five missing days that don't exist in a 360day calendar. For such reasons it's generally impossible to convert calendars (and Iris doesn't allow such operations).
Hope this helps!
Maybe the answer is not more useful however I write here the function that I made in order to convert the data from %Y%m%d.%f in datetime array.
The function create a perfect datetime array, without missing values, it can be modified to take into account if there are missing times, however a climate model should not have missing data.
def fromEARTHtime2Datetime(dt,timeVecEARTH):
"""
This function returns the perfect array from the EARTH %Y%m%d.%f time
format and convert it to a more useful time, such as the time array
from the datetime of pyhton, this is WHTOUT any missing data!
Parameters
----------
dt : string
This is the time discretization, it can be 1h or 6h, but always it
needs to be hours, example dt = '6h'.
timeVecEARTH : array of float
Vector of the time to be converted. For example the time of the
EARTH is day as %Y%m%d.%f.
And only this format can be converted to datetime, for example:
20490128.0,20490128.25,20490128.5,20490128.75 this will be converted
in datetime: '2049-01-28 00:00:00', '2049-01-28 60:00:00',
'2049-01-28 12:00:00','2049-01-28 18:00:00'
Returns
-------
timeArrNew : datetime
This is the perfect and WITHOUT any missing data datatime array,
for example: DatetimeIndex(['2049-01-28 00:00:00', '2049-01-28 06:00:00',
...
'2049-02-28 18:00:00', '2049-03-01 00:00:00'],
dtype='datetime64[ns]', length=129, freq='6H')
"""
dtDay = 24/np.float(dt[:-1])
partOfDay = np.arange(0,1,1/dtDay)
hDay = []
for ip in partOfDay:
hDay.append('%02.f:00:00' %(24*ip))
dictHours = dict(zip(partOfDay,hDay))
t0Str = str(timeVecEARTH[0])
timeAux0 = t0Str.split('.')
timeAux0 = timeAux0[0][0:4] +'-' + timeAux0[0][4:6] +'-' + timeAux0[0][6:] + ' ' + dictHours[float(timeAux0[1])]
tendStr = str(timeVecEARTH[-1])
timeAuxEnd = tendStr.split('.')
timeAuxEnd = timeAuxEnd[0][0:4] +'-' + timeAuxEnd[0][4:6] +'-' + timeAuxEnd[0][6:] + ' ' + dictHours[float(timeAuxEnd[1])]
timeArrNew = pd.date_range(timeAux0,timeAuxEnd, freq=dt)
return timeArrNew

Faster approach to transform a bunch of .csv file into HDF dataframe

1. Background
HDF is a superb file format for data storage and management.
I have a source data (365 .csv files) which containing the air quality data (time resolution 1h) for all monitoring sites (more than 1500) of China. Each file is consisted of many feature (particulate matter, SO2, etc) and its corresponding time.
I have uploaded some template files here for someone interested.
My goal ==> Merge all the files into one dataframe for efficient management
2. My code
# -*- coding: utf-8 -*-
#coding=utf-8
import pandas as pd
from pandas import HDFStore, DataFrame
from pandas import read_hdf
import os,sys,string
import numpy as np
### CREAT A EMPTY HDF5 FILE
hdf = HDFStore("site_2016_whole_year.h5")
### READ THE CSV FILES AND SAVE IT INTO HDF5 FORMAT
os.chdir("./site_2016/")
files = os.listdir("./")
files.sort()
### Read an template file to get the name of columns
test_file= "china_sites_20160101.csv"
test_f = pd.read_csv(test_file,encoding='utf_8')
site_columns = list(test_f.columns[3:])
print site_columns[1]
feature = ['pm25','pm10','O3','O3_8h','CO',"NO2",'SO2',"aqi"]
fe_dict = {"pm25":1,"aqi":0, 'pm10':3, 'SO2':5,'NO2':7, 'O3':9,"O3_8h":11, "CO": 13}
for k in range(0,len(feature),1):
data_2016 = {"date":[],'hour':[],}
for i in range(0,len(site_columns),1):
data_2016[site_columns[i]] = []
for file in files[0:]:
filename,extname = os.path.splitext(file)
if (extname == ".csv"):
datafile =file
f_day = pd.read_csv(datafile,encoding='utf_8')
site_columns = list(f_day.columns[3:])
for i in range(0,len(f_day),15):
datetime = str(f_day["date"].iloc[i])
hour = "%02d" % ((f_day["hour"].iloc[i]))
data_2016["date"].append(datetime)
data_2016["hour"].append(hour)
for t in range(0,len(site_columns),1):
data_2016[site_columns[t]].\
append(f_day[site_columns[t]].iloc[i+fe_dict[feature[k]]])]
data_2016 = pd.DataFrame(data_2016)
hdf.put(feature[k], data_2016, format='table', encoding="utf-8")
3. My problem
Using my code above, the hdf5 file can be created but with slow speed.
My lab has a Linux cluster with 32 core CPU. Is there any method to transform my program into multi-processing ones?
maybe I do not understand your problem properly, but I would use something like this:
import os
import pandas as pd
indir = <'folder with downloaded 12 csv files'>
indata = []
for i in os.listdir(indir):
indata.append(pd.read_csv(indir + i))
out_table = pd.concat(indata)
hdf = pd.HDFStore("site_2016_whole_year.h5", complevel=9, complib='blosc')
hdf.put('table1',out_table)
hdf.close()
for 12 input files it takes 2.5 sec on my laptop, so even for 365 files it should be done in a minute or so. I do not see need for parallelisation in this case.

Reading Time Series from netCDF with python

I'm trying to create time series from a netCDF file (accessed via Thredds server) with python. The code I use seems correct, but the values of the variable amb reading are 'masked'. I'm new into python and I'm not familiar with the formats. Any idea of how can I read the data?
This is the code I use:
import netCDF4
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
from datetime import datetime, timedelta #
dayFile = datetime.now() - timedelta(days=1)
dayFile = dayFile.strftime("%Y%m%d")
url='http://nomads.ncep.noaa.gov:9090/dods/nam/nam%s/nam1hr_00z' %(dayFile)
# NetCDF4-Python can open OPeNDAP dataset just like a local NetCDF file
nc = netCDF4.Dataset(url)
varsInFile = nc.variables.keys()
lat = nc.variables['lat'][:]
lon = nc.variables['lon'][:]
time_var = nc.variables['time']
dtime = netCDF4.num2date(time_var[:],time_var.units)
first = netCDF4.num2date(time_var[0],time_var.units)
last = netCDF4.num2date(time_var[-1],time_var.units)
print first.strftime('%Y-%b-%d %H:%M')
print last.strftime('%Y-%b-%d %H:%M')
# determine what longitude convention is being used
print lon.min(),lon.max()
# Specify desired station time series location
# note we add 360 because of the lon convention in this dataset
#lati = 36.605; loni = -121.85899 + 360. # west of Pacific Grove, CA
lati = 41.4; loni = -100.8 +360.0 # Georges Bank
# Function to find index to nearest point
def near(array,value):
idx=(abs(array-value)).argmin()
return idx
# Find nearest point to desired location (no interpolation)
ix = near(lon, loni)
iy = near(lat, lati)
print ix,iy
# Extract desired times.
# 1. Select -+some days around the current time:
start = netCDF4.num2date(time_var[0],time_var.units)
stop = netCDF4.num2date(time_var[-1],time_var.units)
time_var = nc.variables['time']
datetime = netCDF4.num2date(time_var[:],time_var.units)
istart = netCDF4.date2index(start,time_var,select='nearest')
istop = netCDF4.date2index(stop,time_var,select='nearest')
print istart,istop
# Get all time records of variable [vname] at indices [iy,ix]
vname = 'dswrfsfc'
var = nc.variables[vname]
hs = var[istart:istop,iy,ix]
tim = dtime[istart:istop]
# Create Pandas time series object
ts = pd.Series(hs,index=tim,name=vname)
The var data are not read as I expected, apparently because data is masked:
>>> hs
masked_array(data = [-- -- -- ..., -- -- --],
mask = [ True True True ..., True True True],
fill_value = 9.999e+20)
The var name, and the time series are correct, as well of the rest of the script. The only thing that doesn't work is the var data retrieved. This is the time serie I get:
>>> ts
2016-10-25 00:00:00.000000 NaN
2016-10-25 01:00:00.000000 NaN
2016-10-25 02:00:00.000006 NaN
2016-10-25 03:00:00.000000 NaN
2016-10-25 04:00:00.000000 NaN
... ... ... ... ...
2016-10-26 10:00:00.000000 NaN
2016-10-26 11:00:00.000006 NaN
Name: dswrfsfc, dtype: float32
Any help will be appreciated!
Hmm, this code looks familiar. ;-)
You are getting NaNs because the NAM model you are trying to access now uses longitude in the range [-180, 180] instead of the range [0, 360]. So if you request loni = -100.8 instead of loni = -100.8 +360.0, I believe your code will return non-NaN values.
It's worth noting, however, that the task of extracting time series from multidimensional gridded data is now much easier with xarray, because you can simply select a dataset closest to a lon,lat point and then plot any variable. The data only gets loaded when you need it, not when you extract the dataset object. So basically you now only need:
import xarray as xr
ds = xr.open_dataset(url) # NetCDF or OPeNDAP URL
lati = 41.4; loni = -100.8 # Georges Bank
# Extract a dataset closest to specified point
dsloc = ds.sel(lon=loni, lat=lati, method='nearest')
# select a variable to plot
dsloc['dswrfsfc'].plot()
Full notebook here: http://nbviewer.jupyter.org/gist/rsignell-usgs/d55b37c6253f27c53ef0731b610b81b4
I checked your approach with xarray. Works great to extract Solar radiation data! I can add that the first point is not defined (NaN) because the model starts calculating there, so there is no accumulated radiation data (to calculate hourly global radiation). So that is why it is masked.
Something everyone overlooked is that the output is not correct. It does look ok (at noon= sunshine, at nmidnight=0, dark), but the daylength is not correct! I checked it for 52 latitude north and 5.6 longitude (east) (November) and daylength is at least 2 hours too much! (The NOAA Panoply viewer for Netcdf databases gives similar results)

Loading files to perform Kmean using sklearn

I have 100 files that contain system call traces. Each files is presented as seen below:
setpgrp ioctl setpgrp ioctl ioctl ....
I am trying to load these files and perform kmean calculation on them to cluster them based on similarities. Based on a tutorial on the sklearn webpage I written the following:
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics
from sklearn.datasets import load_files
from sklearn.cluster import KMeans, MiniBatchKMeans
import numpy as np
# parse commandline arguments
op = OptionParser()
op.add_option("--lsa",
dest="n_components", type="int",
help="Preprocess documents with latent semantic analysis.")
op.add_option("--no-minibatch",
action="store_false", dest="minibatch", default=True,
help="Use ordinary k-means algorithm (in batch mode).")
op.add_option("--use-idf",
action="store_false", dest="use_idf", default=True,
help="Disable Inverse Document Frequency feature weighting.")
op.add_option("--n-features", type=int, default=10000,
help="Maximum number of features (dimensions)"
" to extract from text.")
op.add_option("--verbose",
action="store_true", dest="verbose", default=False,
help="Print progress reports inside k-means algorithm.")
print(__doc__)
op.print_help()
(opts, args) = op.parse_args()
if len(args) > 0:
op.error("this script takes no arguments.")
sys.exit(1)
print("Loading training data:")
trainingdata = load_files('C:\data\Training data')
print("%d documents" % len(trainingdata.data))
print()
print("Extracting features from the training trainingdata using a sparse vectorizer")
if opts.use_idf:
vectorizer = TfidfVectorizer(input="file",min_df=1)
X = vectorizer.fit_transform(trainingdata.data)
print("n_samples: %d, n_features: %d" % X.shape)
print()
if opts.n_components:
print("Performing dimensionality reduction using LSA")
# Vectorizer results are normalized, which makes KMeans behave as
# spherical k-means for better results. Since LSA/SVD results are
# not normalized, we have to redo the normalization.
svd = TruncatedSVD(opts.n_components)
lsa = make_pipeline(svd, Normalizer(copy=False))
X = lsa.fit_transform(X)
explained_variance = svd.explained_variance_ratio_.sum()
print("Explained variance of the SVD step: {}%".format(
int(explained_variance * 100)))
print()
However it seems that none of the files in the dataset directory get loaded into the memory when though all files are available. I get the following error when executing the program:
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
Can anyone tell me why the dataset is not being loaded? What am I doing wrong?
I finally managed to load the files. The approach to use Kmean in sklearn is to vectorize the training data (using tfidf or count_vectorizer), then transform your test data using the vectorization of your training data. Once that is done you can initialize the Kmean parameters, use the training data set vectors to create the kmean cluster. Finally you can cluster your test data around your training data centroid.
The following code does what is explained above.
#Read the data in a directory:
def readfile(dataDir):
data_set = []
for file in os.listdir(dataDir):
trainingfiles = os.path.join(dataDir, file)
if os.path.isfile(trainingfiles):
data = open(trainingfiles, 'r')
dataread=str.decode(data.read())
data_set.append(dataread)
return data_set
#fitting tfidf transfrom for training data
tfidf_vectorizer_trainingset = tfidf_vectorizer.fit_transform(readfile(trainingdataDir)).toarray()
#transform the test set based on the training set
tfidf_vectorizer_testset = tfidf_vectorizer.transform(readfile(testingdataDir)).toarray()
# Kmean Clustering parameters
kmean_parameters = KMeans(n_clusters=number_of_clusters, init='k-means++', max_iter=100, n_init=1)
#Cluster the training data based on the parameters
KmeanAnalysis_training = kmean_parameters.fit(tfidf_vectorizer_trainingset)
#transform the test data based on the clustering of the training data
KmeanAnalysis_test = kmean_parameters.transform(tfidf_vectorizer_testset)