Serialize pandas dataframe consists NaN fields before sending as a response - django

I have a dataframe that has NaN fields in it. I want to send this dataframe as a response. Because it has Nan fields I get this error,
ValueError: Out of range float values are not JSON compliant
I don't want to drop the fields or fill them with a character or etc. and the default response structure is ideal for my application.
Here is my views.py
...
forecast = forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]
forecast['actual_value'] = df['y'] # <- Nan fields are added here
forecast.rename(
columns={
'ds': 'date',
'yhat': 'predictions',
'yhat_lower': 'lower_bound',
'yhat_upper': 'higher_bound'
}, inplace=True
)
context = {
'detail': forecast
}
return Response(context)
Dataframe,
date predictions lower_bound higher_bound actual_value
0 2022-07-23 06:31:41.362011 3.832143 -3.256209 10.358063 1.0
1 2022-07-23 06:31:50.437211 4.169004 -2.903518 10.566005 7.0
2 2022-07-28 14:20:05.000000 12.085815 5.267806 18.270929 20.0
...
16 2022-08-09 15:07:23.000000 105.655997 99.017424 112.419991 NaN
17 2022-08-10 15:07:23.000000 115.347283 108.526287 122.152684 NaN
Hoping to find a way to send dataframe as a response.

You could try to use the fillna and replace methods to get rid of those NaN values.
Adding something like this should work as None values are JSON compliant:
forecast = forecast.fillna(np.nan).replace([np.nan], [None])
Using replace alone can be enough, but using fillna prevent errors if you also have NaT values for example.

Related

Pandas regex replacement when there is no match

I'm using pandas.Series.str.replace to extract numbers from strings (its data that has been scraped from #WPWeather) and have got the point where I've extracted all the fields into a DataFrame like this...
df.head()
Out[48]:
temp pressure relative_humidity \
created_at
2019-12-13 10:19:13 5.2\xc2\xbaC, 975.4mb, 91.3%.
2019-12-12 10:19:07 2\xc2\xbaC, 990.3mb, 96.9%.
2019-12-11 10:19:07 4.2\xc2\xbaC, 1000.8mb, 85.7%.
2019-12-10 10:19:00 6.3\xc2\xbaC, 1008.5mb, 94.4%.
2019-12-09 10:18:51 5.4\xc2\xbaC, 1006.7mb, 68.5%.
last_24_max_temp last_24_min_temp rain sunshine
created_at
2019-12-13 10:19:13 7\xc2\xbaC, 2\xc2\xbaC, 9.5mm, 0
2019-12-12 10:19:07 6\xc2\xbaC, 1.5\xc2\xbaC, 0.9mm.' NaN
2019-12-11 10:19:07 11.7\xc2\xbaC, 2.2\xc2\xbaC, 14.1mm.' NaN
2019-12-10 10:19:00 6.5\xc2\xbaC, 1.9\xc2\xbaC, 1.1mm.' NaN
2019-12-09 10:18:51 8.5\xc2\xbaC, 5.2\xc2\xbaC, 1.5mm, 1.9
I'm trying to use regex's to extract the numerical values using...
pd.to_numeric(df['temp'].str.replace(r'(^-?\d+(?:\.\d+)?)(.*)', r'\1', regex=True))
...and it works well but I've hit an instance where one of the temperature fields doesn't have a value and is simply \xc2\xbaC,, as a consequence there is nothing matched in the first grouping to use in r'\1' and when it gets to trying to convert to numeric it fails with...
pandas/_libs/lib.pyx in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "\xc2\xbaC," at position 120
How do I replace non-matches with something sane such as blank so that when I then call pd.to_numeric() it will convert to NaN?
Onde idea is change string for replace, then got not exist values get missing values:
df['temp'] = pd.to_numeric(df['temp'].str.replace(r'\xc2\xbaC,', '', regex=True))
print (df)
temp pressure relative_humidity
created_at
2019-12-13 10:19:13 5.2 975.4mb, 91.3%.
2019-12-12 10:19:07 2.0 990.3mb, 96.9%.
2019-12-11 10:19:07 4.2 1000.8mb, 85.7%.
2019-12-10 10:19:00 6.3 1008.5mb, 94.4%.
2019-12-09 10:18:51 5.4 1006.7mb, 68.5%.
Your solution should be changed with parameter errors='coerce' in to_numeric for replace non numeric to missing values:
df['temp'] = (pd.to_numeric(df['temp'].str.replace(r'(^-?\d+(?:\.\d+)?)(.*)',r'\1',regex=True),
errors='coerce'))

Reading Time Series from netCDF with python

I'm trying to create time series from a netCDF file (accessed via Thredds server) with python. The code I use seems correct, but the values of the variable amb reading are 'masked'. I'm new into python and I'm not familiar with the formats. Any idea of how can I read the data?
This is the code I use:
import netCDF4
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
from datetime import datetime, timedelta #
dayFile = datetime.now() - timedelta(days=1)
dayFile = dayFile.strftime("%Y%m%d")
url='http://nomads.ncep.noaa.gov:9090/dods/nam/nam%s/nam1hr_00z' %(dayFile)
# NetCDF4-Python can open OPeNDAP dataset just like a local NetCDF file
nc = netCDF4.Dataset(url)
varsInFile = nc.variables.keys()
lat = nc.variables['lat'][:]
lon = nc.variables['lon'][:]
time_var = nc.variables['time']
dtime = netCDF4.num2date(time_var[:],time_var.units)
first = netCDF4.num2date(time_var[0],time_var.units)
last = netCDF4.num2date(time_var[-1],time_var.units)
print first.strftime('%Y-%b-%d %H:%M')
print last.strftime('%Y-%b-%d %H:%M')
# determine what longitude convention is being used
print lon.min(),lon.max()
# Specify desired station time series location
# note we add 360 because of the lon convention in this dataset
#lati = 36.605; loni = -121.85899 + 360. # west of Pacific Grove, CA
lati = 41.4; loni = -100.8 +360.0 # Georges Bank
# Function to find index to nearest point
def near(array,value):
idx=(abs(array-value)).argmin()
return idx
# Find nearest point to desired location (no interpolation)
ix = near(lon, loni)
iy = near(lat, lati)
print ix,iy
# Extract desired times.
# 1. Select -+some days around the current time:
start = netCDF4.num2date(time_var[0],time_var.units)
stop = netCDF4.num2date(time_var[-1],time_var.units)
time_var = nc.variables['time']
datetime = netCDF4.num2date(time_var[:],time_var.units)
istart = netCDF4.date2index(start,time_var,select='nearest')
istop = netCDF4.date2index(stop,time_var,select='nearest')
print istart,istop
# Get all time records of variable [vname] at indices [iy,ix]
vname = 'dswrfsfc'
var = nc.variables[vname]
hs = var[istart:istop,iy,ix]
tim = dtime[istart:istop]
# Create Pandas time series object
ts = pd.Series(hs,index=tim,name=vname)
The var data are not read as I expected, apparently because data is masked:
>>> hs
masked_array(data = [-- -- -- ..., -- -- --],
mask = [ True True True ..., True True True],
fill_value = 9.999e+20)
The var name, and the time series are correct, as well of the rest of the script. The only thing that doesn't work is the var data retrieved. This is the time serie I get:
>>> ts
2016-10-25 00:00:00.000000 NaN
2016-10-25 01:00:00.000000 NaN
2016-10-25 02:00:00.000006 NaN
2016-10-25 03:00:00.000000 NaN
2016-10-25 04:00:00.000000 NaN
... ... ... ... ...
2016-10-26 10:00:00.000000 NaN
2016-10-26 11:00:00.000006 NaN
Name: dswrfsfc, dtype: float32
Any help will be appreciated!
Hmm, this code looks familiar. ;-)
You are getting NaNs because the NAM model you are trying to access now uses longitude in the range [-180, 180] instead of the range [0, 360]. So if you request loni = -100.8 instead of loni = -100.8 +360.0, I believe your code will return non-NaN values.
It's worth noting, however, that the task of extracting time series from multidimensional gridded data is now much easier with xarray, because you can simply select a dataset closest to a lon,lat point and then plot any variable. The data only gets loaded when you need it, not when you extract the dataset object. So basically you now only need:
import xarray as xr
ds = xr.open_dataset(url) # NetCDF or OPeNDAP URL
lati = 41.4; loni = -100.8 # Georges Bank
# Extract a dataset closest to specified point
dsloc = ds.sel(lon=loni, lat=lati, method='nearest')
# select a variable to plot
dsloc['dswrfsfc'].plot()
Full notebook here: http://nbviewer.jupyter.org/gist/rsignell-usgs/d55b37c6253f27c53ef0731b610b81b4
I checked your approach with xarray. Works great to extract Solar radiation data! I can add that the first point is not defined (NaN) because the model starts calculating there, so there is no accumulated radiation data (to calculate hourly global radiation). So that is why it is masked.
Something everyone overlooked is that the output is not correct. It does look ok (at noon= sunshine, at nmidnight=0, dark), but the daylength is not correct! I checked it for 52 latitude north and 5.6 longitude (east) (November) and daylength is at least 2 hours too much! (The NOAA Panoply viewer for Netcdf databases gives similar results)

Grab rows between two Datetime and avoid iterating

I use Pandas to retrieve a lot of Data via an SQL query (from Hive). I have a big DataFrame now:
market_pings = pandas.read_sql_query(query, engine)
market_pings['event_time'] = pandas.to_datetime(market_pings['event_time'])
I have calculated Time Delta periods which are: if something interesting happens within the timeline of these events within this market_pings DataFrame, I want the logs of that time interval only.
To grab DataFrame rows where a column has certain values there is a cool trick:
valuelist = ['value1', 'value2', 'value3']
df = df[~df.column.isin(value_list)]
Does anyone have an idea how to do this for time periods, so that I get the events of certain times from the market_pings DataFrame without direct Iteration (row by row)?
I can build a list of periods (1s accuracy) like:
2015-08-03 19:19:47
2015-08-03 19:20:00
But this means my valuelist becomes a tupel and I somehow have to compare dates.
You can create a list of time stamp as value_list and do operation you intend to.
time_list = [pd.Timestamp('2015-08-03 19:19:47'),pd.Timestamp('2015-08-03 19:20:00') ]
One thing in using between_time() is index have to be that date or time,
If not you can set by set_index()
mydf = pd.Series(np.random.randn(4), time_list)
mydf
Out[123]:
2015-08-03 19:19:47 0.632509
2015-08-03 19:20:00 -0.234267
2015-08-03 19:19:48 0.159056
2015-08-03 21:20:00 -0.842017
dtype: float64
mydf.between_time(start_time=pd.Timestamp('2015-08-03 19:19:47'),
end_time=pd.Timestamp('2015-08-03 19:20:00'),include_end=False)
Out[124]:
2015-08-03 19:19:47 0.632509
2015-08-03 19:19:48 0.159056
dtype: float64
mydf.between_time(start_time=pd.Timestamp('2015-08-03 19:19:47'),
end_time=pd.Timestamp('2015-08-03 19:20:00'),
include_end=False,include_start=False)
Out[125]:
2015-08-03 19:19:48 0.159056
dtype: float64

Pandas Interpolate Returning NaNs

I'm trying to do basic interpolation of position data at 60hz (~16ms) intervals. When I try to use pandas 0.14 interpolation over the dataframe, it tells me I only have NaNs in my data set (not true). When I try to run it over individual series pulled from the dataframe, it returns the same series without the NaNs filled in. I've tried setting the indices to integers, using different methods, fiddling with the axis and limit parameters of the interpolation function - no dice. What am I doing wrong?
df.head(5) :
x y ms
0 20.5815 14.1821 333.3333
1 NaN NaN 350
2 20.6112 14.2013 366.6667
3 NaN NaN 383.3333
4 20.5349 14.2232 400
df = df.set_index(df.ms) # set indices to milliseconds
When I try running
df.interpolate(method='values')
I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-462-cb0f1f01eb84> in <module>()
12
13
---> 14 df.interpolate(method='values')
15
16
/Users/jsb/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in interpolate(self, method, axis, limit, inplace, downcast, **kwargs)
2511
2512 if self._data.get_dtype_counts().get('object') == len(self.T):
-> 2513 raise TypeError("Cannot interpolate with all NaNs.")
2514
2515 # create/use the index
TypeError: Cannot interpolate with all NaNs.
I've also tried running over individual series, which only return what I put in:
temp = df.x
temp.interpolate(method='values')
333.333333 20.5815
350.000000 NaN
366.666667 20.6112
383.333333 NaN
400.000000 20.5349 Name: x, dtype: object
EDIT :
Props to Jeff for inspiring the solution.
Adding:
df[['x','y','ms']] = df[['x','y','ms']].astype(float)
before
df.interpolate(method='values')
interpolation did the trick.
Based on your edit with props to Jeff for inspiring the solution.
Adding:
df = df.astype(float)
before
df.interpolate(method='values')
interpolation did the trick for me as well. Unless you're sub-selecting a column set, you don't need to specify the columns.
I'm not able to to reproduce the error (see below for a copy/paste-able example), can you make sure the the data you show is actually representative of your data?
In [137]: from StringIO import StringIO
In [138]: df = pd.read_csv(StringIO(""" x y ms
...: 0 20.5815 14.1821 333.3333
...: 1 NaN NaN 350
...: 2 20.6112 14.2013 366.6667
...: 3 NaN NaN 383.3333
...: 4 20.5349 14.2232 400"""), delim_whitespace=True)
In [140]: df = df.set_index(df.ms)
In [142]: df.interpolate(method='values')
Out[142]:
x y ms
ms
333.3333 20.58150 14.18210 333.3333
350.0000 20.59635 14.19170 350.0000
366.6667 20.61120 14.20130 366.6667
383.3333 20.57305 14.21225 383.3333
400.0000 20.53490 14.22320 400.0000

Python Pandas: Unexplainable NaN values

Is there some error that I am doing or is it a fault within pandas or perhaps Quandl?
I'm pretty sure the problem is in the following line:
quandl_gold_fridays['Round'] = quandl_gold['Close'].apply(lambda x: int(float(x)/23))
Notice that you used quandl_gold on the right-hand side instead of quandl_gold_fridays. The date corresponding with your NaN is 2014-04-18, which was Good Friday (i.e. markets closed). There would be no corresponding value in quandl_gold on that date for the lambda to use, so it would be passed NaN.
To illustrate, try adding a cell with the following code:
import pandas as pd
x = pd.merge(left=quandl_gold.loc[:, ['Close']],
right=quandl_gold_fridays.loc[:, ['Close','Round']],
left_index=True,
right_index=True,
how='right')
x.tail(10)
You'll notice the NaN in the "Close_x" column.