How can compute "Mean Absolute Error (MAE)" by CDO - cdo-climate

I did not find as a new student how to calculate "Mean Absolute Error (MAE)" by CDO!
Can you assist us with how to compute by CDO!

if you have your observations (or analysis?) in obs.nc and model in model.nc then you can calculate the MAE in the following way
# calculate the difference
cdo sub model.nc obs.nc diff.nc
# absolute value
cdo abs diff.nc abs.nc
# time mean
cdo timmean abs.nc MAE.nc
or piping it all in one line:
cdo timmean -abs -sub model.nc obs.nc MAE.nc
If instead of the temporal mean you need the spatial MAE then of course you replace timmean with fldmean

Related

Extracting SST time series at multiple lat, lon locations using CDO

Background: I am working with satellite observations of SST (time, lat, lon) from the CoRTAD SST dataset (netCDF file). I have a set of (lon,lat) coordinates following the coastal contour of Portugal (called below midshelf locations). I want to extract the SST time series at each of these midshelf locations, average them and subtract the SST at the same latitudes but a fixed longitude to give a coastal SST index.
The midshelf lon,lats were determined first from a nautical chart, which were then linearly interpolated to the lon,lats in the CoRTAD grid.
How can this be done using CDO?
The lon,lats from the nautical map are given below.
midshelf lon,lats from the nautical map:
-8.000 43.860
-9.000 43.420
-9.350 43.220
-9.388 42.893
-9.000 42.067
-8.935 41.308
-9.000 40.692
-9.278 40.000
-9.324 39.550
-9.518 39.387
-9.777 38.883
-9.285 38.378
-8.909 38.293
-8.951 38.000
-8.965 37.953
-8.917 37.833
-8.913 37.667
-8.915 37.500
-8.975 37.333
-9.017 37.167
-9.045 37.000
So here is my attempt to answer the question as it was stated in the comments (i.e. you wanted an index which was the midshelf locations averaged and then subtracting the same latitude SST sampled at Longitude=9E). I assume the locations are stored pair-wise in a text file called "locations.txt" as in your question above. The loop part of the answer is from one of this question's solutions.
# first loop over the pairs of indices in the text files.
while read -r -a fields; do
for ((i=0; i < ${#fields[#]}; i += 2)); do
# precise lon/lat for sampled mid-shelf
cdo remapnn,"lon=${fields[i]}/lat=${fields[i+1]}" in.nc pt_${i}.nc
# same lat but lon=9E (change as wanted)
cdo remapnn,"lon=9/lat=${fields[i+1]}" in.nc 9E_${i}.nc
done
done < example.txt
# now take the ensemble average over the points.
cdo ensmean pt_*.nc mid_shelf_sst.nc
cdo ensmean 9E_*.nc mid_shelf_9E.nc
# and calculate the index
cdo sub mid_shelf_sst.nc mid_shelf_9E.nc SST_index.nc

Divide the testing set into subgroup, then make prediction on each subgroup separately

I have a dataset similar to the following table:
The prediction target is going to be the 'score' column. I'm wondering how can I divide the testing set into different subgroups such as score between 1 to 3 or then check the accuracy on each subgroup.
Now what I have is as follows:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = tree.DecisionTreeRegressor()
model.fit(X_train, y_train)
for i in (0,1,2,3,4):
y_new=y_test[(y_test>=i) & (y_test<=i+1)]
y_new_pred=model.predict(X_test)
print metrics.r2_score(y_new, y_new_pred)
However, my code did not work and this is the traceback that I get:
Found input variables with inconsistent numbers of samples: [14279,
55955]
I have tried the solution provided below, but it looks like that for the full score range (0-5) the r^2 is 0.67. but the subscore range for example (0-1,1-2,2-3,3-4,4-5) the r^2s are significantly lower than that of the full range. shouldn't some of the subscore r^2 be higher than 0.67 and some of them be lower than 0.67?
Could anyone kindly let me know where did I do wrong? Thanks a lot for all your help.
When you are computing the metrics, you have to filtered the predicted values (based on your subset condition).
Basically you are trying to compute
metrics.r2_score([1,3],[1,2,3,4,5])
which creates an error,
ValueError: Found input variables with inconsistent numbers of
samples: [2, 5]
Hence, my suggested solution would be
model.fit(X_train, y_train)
#compute the prediction only once.
y_pred = model.predict(X_test)
for i in (0,1,2,3,4):
#COMPUTE THE CONDITION FOR SUBSET HERE
subset = (y_test>=i) & (y_test<=i+1)
print metrics.r2_score(y_test [subset], y_pred[subset])

Cumsum function in python

I have the below-mentioned dataset.
https://docs.google.com/spreadsheets/d/13GCAXHp5BU4vYU6PdX40wM-Jhp--LeRd9C5oUurbVY4/edit#gid=0
I want to find the cumulative values for sales for difference stores in one column. For example, the cumulative value for store 2106 the sales figure should be 176,849
I'm using the following function
df = df.groupby('storenumber')['sales'].cumsum() but i am not getting the correct result
Can someone help?
Here's what I did to solve this problem.
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv') # get data frame from csv file
You won't be able to run numerical operations on your data, as it is, because the Sale (Dollars) column in df is not formatted as a numerical type. The following piece of code will convert the data in the Sale (Dollars) and Suggested answer column to be of type float and remove the dollar sign and separating commas.
df[df.columns[2:]] = df[df.columns[2:]].replace('[\$,]', '', regex=True).astype(float)
Then, I used the following bit of code to get the cumulative value for each unique Store Number.
cum_sales_by_store_number = df.groupby('Store Number')['Sale (Dollars)'].agg(np.sum)
cum_sales_by_store_number = pd.DataFrame(cum_sales_by_store_number)
Output for cum_sales_by_store_number:
Sale (Dollars)
Store Number
2106 176849.97
I hope this answers your question. Happy coding!

can we use data generator for regression? (Keras, python)

I want to augmente my data using data generator in keras as below:
datagen = ImageDataGenerator(
featurewise_center=True, # set input mean to 0 over the dataset
samplewise_center=True, # set each sample mean to 0
featurewise_std_normalization=True, # divide inputs by std of the dataset
samplewise_std_normalization=True, # divide each input by its std
zca_whitening=True, # apply ZCA whitening
rotation_range=0, # randomly rotate images in the range (degrees, 0 to 180)
rescale=1./255,
shear_range=0.2,
zoom_range=0,
width_shift_range=0, # randomly shift images horizontally (fraction of total width)
height_shift_range=0, # randomly shift images vertically (fraction of total height)
horizontal_flip=True, # randomly flip images
vertical_flip=True) # randomly flip images
but I use this network for regression not classification. I have my doubts that datagenerator set new output values or not? Because If I used 0 or 1 classification problem then datagenerator could flip or rotate data without changing output but in here it should change output like input. Does that do this?
Thanks in advance.

Calculating the Gini Coefficient from LIS data (in Stata)

I need to calculate the Gini coefficient from disposable personal income data at LIS. According to a LIS training document, the Stata code to do this is:
di "** INCOME DISTRIBUTION II – Exercise 13 **"
program define bottop
qui sum ey [w=hweight*d4]
replace ey = .01*r(mean) if ey<.01*r(mean)
qui sum dpi [w=hweight*d4], de
replace ey = (10*r(p50)/(d4^.5)) if dpi>10*r(p50)
end
foreach file in $us00h $fi00h {
display "`file'"
use hweight d4 dpi if (!mi(dpi) & !(dpi==0)) using "`file'", clear
gen ey=dpi/(d4^0.5)
bottop
ineqdeco ey [w=hweight*d4]
}
I have simply copied and pasted this code from the training document. The snippets
qui sum ey [w=hweight*d4]
replace ey=0.01*r(mean) if ey<0.01*r(mean)
and
qui sum dpi [w=hweight*d4], de
replace ey=(10*r(p50)/(d4^0.5)) if dpi>10*r(p50)
are bottom and top coding, respectively.
When I tried to run this code, the variable hweight was not found. Does anyone know what the new name of hweight is at LIS? Or can anyone suggest how I might otherwise overcome this impasse?
I'm familiar with stata, but the sophistication of this code is beyond my ken.
Much appreciated.
Based on the varaiable definition list at the LIS Documentation page, it looks like the variable is now called HWGT
This is more of a second-best solution. However, the census of population provides income by brackets. If you are willing to do that, you can get the counts for every bracket. Have a top-coded bracket for the last one. Use the median income value within each bracket. Then you can directly apply the formula for the Gini coefficient. It is a second best because it is an approximation for the individaul-level data.
Why don't you try the fastgini command:
http://www.stata.com/statalist/archive/2007-02/msg00524.html
ssc install fastgini
fastgini income
return list
this should give you the gini for the variable income.
This package also allows for weights. Type
help fastgini
for more information