Pandas Interpolate Returning NaNs - python-2.7

I'm trying to do basic interpolation of position data at 60hz (~16ms) intervals. When I try to use pandas 0.14 interpolation over the dataframe, it tells me I only have NaNs in my data set (not true). When I try to run it over individual series pulled from the dataframe, it returns the same series without the NaNs filled in. I've tried setting the indices to integers, using different methods, fiddling with the axis and limit parameters of the interpolation function - no dice. What am I doing wrong?
df.head(5) :
x y ms
0 20.5815 14.1821 333.3333
1 NaN NaN 350
2 20.6112 14.2013 366.6667
3 NaN NaN 383.3333
4 20.5349 14.2232 400
df = df.set_index(df.ms) # set indices to milliseconds
When I try running
df.interpolate(method='values')
I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-462-cb0f1f01eb84> in <module>()
12
13
---> 14 df.interpolate(method='values')
15
16
/Users/jsb/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in interpolate(self, method, axis, limit, inplace, downcast, **kwargs)
2511
2512 if self._data.get_dtype_counts().get('object') == len(self.T):
-> 2513 raise TypeError("Cannot interpolate with all NaNs.")
2514
2515 # create/use the index
TypeError: Cannot interpolate with all NaNs.
I've also tried running over individual series, which only return what I put in:
temp = df.x
temp.interpolate(method='values')
333.333333 20.5815
350.000000 NaN
366.666667 20.6112
383.333333 NaN
400.000000 20.5349 Name: x, dtype: object
EDIT :
Props to Jeff for inspiring the solution.
Adding:
df[['x','y','ms']] = df[['x','y','ms']].astype(float)
before
df.interpolate(method='values')
interpolation did the trick.

Based on your edit with props to Jeff for inspiring the solution.
Adding:
df = df.astype(float)
before
df.interpolate(method='values')
interpolation did the trick for me as well. Unless you're sub-selecting a column set, you don't need to specify the columns.

I'm not able to to reproduce the error (see below for a copy/paste-able example), can you make sure the the data you show is actually representative of your data?
In [137]: from StringIO import StringIO
In [138]: df = pd.read_csv(StringIO(""" x y ms
...: 0 20.5815 14.1821 333.3333
...: 1 NaN NaN 350
...: 2 20.6112 14.2013 366.6667
...: 3 NaN NaN 383.3333
...: 4 20.5349 14.2232 400"""), delim_whitespace=True)
In [140]: df = df.set_index(df.ms)
In [142]: df.interpolate(method='values')
Out[142]:
x y ms
ms
333.3333 20.58150 14.18210 333.3333
350.0000 20.59635 14.19170 350.0000
366.6667 20.61120 14.20130 366.6667
383.3333 20.57305 14.21225 383.3333
400.0000 20.53490 14.22320 400.0000

Related

I am trying to run Dickey-Fuller test in statsmodels in Python but getting error

I am trying to run Dickey-Fuller test in statsmodels in Python but getting error P
Running from python 2.7 & Pandas version 0.19.2. Dataset is from Github and imported the same
enter code here
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
print 'Results of Dickey-Fuller Test:'
dftest = ts.adfuller(timeseries, autolag='AIC' )
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print dfoutput
test_stationarity(tr)
Gives me following error :
Results of Dickey-Fuller Test:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-10ab4b87e558> in <module>()
----> 1 test_stationarity(tr)
<ipython-input-14-d779e1ed35b3> in test_stationarity(timeseries)
19 #Perform Dickey-Fuller test:
20 print 'Results of Dickey-Fuller Test:'
---> 21 dftest = ts.adfuller(timeseries, autolag='AIC' )
22 #dftest = adfuller(timeseries, autolag='AIC')
23 dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
C:\Users\SONY\Anaconda2\lib\site-packages\statsmodels\tsa\stattools.pyc in adfuller(x, maxlag, regression, autolag, store, regresults)
209
210 xdiff = np.diff(x)
--> 211 xdall = lagmat(xdiff[:, None], maxlag, trim='both', original='in')
212 nobs = xdall.shape[0] # pylint: disable=E1103
213
C:\Users\SONY\Anaconda2\lib\site-packages\statsmodels\tsa\tsatools.pyc in lagmat(x, maxlag, trim, original)
322 if x.ndim == 1:
323 x = x[:,None]
--> 324 nobs, nvar = x.shape
325 if original in ['ex','sep']:
326 dropidx = nvar
ValueError: too many values to unpack
tr must be a 1d array-like, as you can see here. I don't know what is tr in your case. Assuming that you defined tr as the dataframe that contains the time serie's data, you should do something like this:
tr = tr.iloc[:,0].values
Then adfuller will be able to read the data.
just change the line as:
dftest = adfuller(timeseries.iloc[:,0].values, autolag='AIC' )
It will work. adfuller requires a 1D array list. In your case you have a dataframe. Therefore mention the column or edit the line as mentioned above.
I am assuming since you are using the Dickey-Fuller test .you want to maintain the timeseries i.e date time column as index.So in order to do that.
tr=tr.set_index('Month') #I am assuming here the time series column name is Month
ts = tr['othercoulumnname'] #Just use the other column name here it might be count or anything
I hope this helps.

Converting datetime to pandas index

My pandas dataframe is structured as follows:
date tag
0 2015-07-30 19:19:35-04:00 E7RG6
1 2016-01-27 08:20:01-05:00 ER57G
2 2015-11-15 23:32:16-05:00 EQW7G
3 2016-07-12 00:01:11-04:00 ERV7G
4 2016-02-14 00:35:21-05:00 EQW7G
5 2016-03-01 00:08:59-05:00 EQW7G
6 2015-06-19 07:15:06-04:00 ER57G
7 2016-09-08 18:17:53-04:00 ER5TT
8 2016-09-03 01:53:45-04:00 EQW7G
9 2015-11-30 09:31:02-05:00 ER57G
10 2016-03-03 22:28:26-05:00 ES5TG
11 2016-02-11 10:39:24-05:00 E5P7G
12 2015-03-16 07:18:47-04:00 ER57G
...
[11015 rows x 2 columns]
date datetime64[ns, America/New_York]
tag object
dtype: object
I'm attempting to set the column 'date' as the index:
df = df.set_index(pd.DatetimeIndex(df['date']))
which yields the following error (using pandas 0.19)
File "pandas/tslib.pyx", line 3753, in pandas.tslib.tz_localize_to_utc (pandas/tslib.c:64516)
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 01:38:12'), try using the 'ambiguous' argument
I've consulted this, but I'm still unable to work through this error. For example,
df = df.set_index(pd.DatetimeIndex(df['date']), ambiguous='infer')
yields:
File "pandas/tslib.pyx", line 3703, in pandas.tslib.tz_localize_to_utc (pandas/tslib.c:63553)
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2015-11-01 01:38:12 asthere are no repeated times
Any advice on how to convert the datetime column to the index would be greatly appreciated.
If your dtype for a column is already datetime then you can just call set_index without the need to try to construct a DatetimeIndex from the column:
df.set_index(df['date'], inplace=True)
should just work, the dtype for the index is sniffed out so there is no need to construct an index object from the Series/column here.

Looping through file with .ix and .isin

My original data looks like this:
SUBBASIN HRU HRU_SLP OV_N
1 1 0.016155144 0.15
1 2 0.015563287 0.14
2 1 0.010589782 0.15
2 2 0.011574839 0.14
3 1 0.013865396 0.15
3 2 0.01744597 0.15
3 3 0.018983217 0.14
3 4 0.013890315 0.05
3 5 0.011792533 0.05
I need to modify value of OV_N for each SUBBASIN number:
hru = pd.read_csv('hru.csv')
for i in hru.OV_N:
hru.ix[hru.SUBBASIN.isin([76,65,64,72,81,84,60,46,37,1,2]), 'OV_N'] = i*(1+df21.value[12])
hru.ix[hru.SUBBASIN.isin([80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12]), 'OV_N'] = i*(1+df23.value[12])
hru.ix[hru.SUBBASIN.isin([85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,49,29,22,24,25,9,10]), 'OV_N'] = i*(1+df56.value[12])
hru.ix[hru.SUBBASIN.isin([92,88,95,94,93]), 'OV_N'] = i*(1+df58.value[12])
where df21.value[12] is a value from a txt file
The code results in an infinite value of OV_N for all subbasins, so I assume that looping through a file goes multiple times, but I can't find a mistake and this code was working before with different numbers of subbasins.
It is generally better not to loop and index over rows in a pandas DataFrame. Transforming the DataFrame by column operations is the more pandasnic approach. A pandas DataFrame can be thought of as a zipped combination of pandas Series: each column is its own pandas Series – all sharing the same index. Operations can be applied to one or more pandas Series to create a new Series that shares the same index. Operations can also be applied to combine a Series with one dimensional numpy array to create a new Series. It is helpful to understand pandas indexing – however this answer will just use sequential integer indexing.
To modify the value of OV_N for each SUBBASIN number:
Initialize the hru DataFrame by reading it in from the hru.csv as in the original question. Here we initialize it with the data given in the question.
import numpy as np
import pandas as pd
hru = pd.DataFrame({
'SUBBASIN':[1,1,2,2,3,3,3,3,3],
'HRU':[1,2,1,2,1,2,3,4,5],
'HRU_SLP':[0.016155144,0.015563287,0.010589782,0.011574839,0.013865396,0.01744597,0.018983217,0.013890315,0.011792533],
'OV_N':[0.15,0.14,0.15,0.14,0.15,0.15,0.14,0.05,0.05]})
Create one separate pandas Series that gathers and stores all the values from the various DataFrames, i.e. df21, df23, df56, and df58, into one place. This will be used to look up values by index. Let’s call it subbasin_multiplier_ds. Let’s respectively assume values of 21, 23, 56, and 58 were read from the txt file. Do replace these with the real values read in from the txt file.
subbasin_multiplier_ds=pd.Series([21]*96)
subbasin_multiplier_ds[80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,
34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12] = 23
subbasin_multiplier_ds[85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,
86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,
49,29,22,24,25,9,10] = 56
subbasin_multiplier_ds[92,88,95,94,93] = 58
Replace OV_N in hru DataFrame based on columns in the DataFrame and a lookup in subbasin_multiplier_ds by index.
hru['OV_N'] = hru['OV_N'] * (1 + subbasin_multiplier_ds[hru['SUBBASIN']].values)
A numpy array is created by .values above so expected results are achieved. If you want to experiment with removing values give it a try to see what happens.

Pandas 0.17.1 error subtracting datetime64[ns] cols if dataframe is empty

I just updated from pandas 0.16.2 to 0.17.1 and am now getting this error when executing existing code that worked fine in 0.16.2
TypeError: ufunc substract cannot use operands with types dtype('<M8[ns]') and dtype('O')
I'm trying to subtract two datetime64 cols on a subset of a dataframe given certain criteria. The error only seems to happen when the subset is empty. In 0.16.2 the below sample code returns a new column D with NaN values, but in 0.17.1 I get the above error.
Any ideas? I need to get this code working again and was hoping not to downgrade back to 0.16.2
df = pd.DataFrame({'A': [1,2], 'B': ['2016-02-11 05:15:34', '2016-02-11 04:04:54'], 'C': ['2016-02-11 14:08:02', '2016-02-12 01:58:51']})
df['B'] = pd.to_datetime(df['B'])
df['C'] = pd.to_datetime(df['C'])
ix = df['A']==3
df.loc[ix, 'D'] = (df[ix]['C'] - df[ix]['B']) / np.timedelta64(1, 's')
df returned in 0.16.2:
A B C D
0 1 2016-02-11 05:15:34 2016-02-11 14:08:02 NaN
1 2 2016-02-11 04:04:54 2016-02-12 01:58:51 NaN

Python Pandas: Unexplainable NaN values

Is there some error that I am doing or is it a fault within pandas or perhaps Quandl?
I'm pretty sure the problem is in the following line:
quandl_gold_fridays['Round'] = quandl_gold['Close'].apply(lambda x: int(float(x)/23))
Notice that you used quandl_gold on the right-hand side instead of quandl_gold_fridays. The date corresponding with your NaN is 2014-04-18, which was Good Friday (i.e. markets closed). There would be no corresponding value in quandl_gold on that date for the lambda to use, so it would be passed NaN.
To illustrate, try adding a cell with the following code:
import pandas as pd
x = pd.merge(left=quandl_gold.loc[:, ['Close']],
right=quandl_gold_fridays.loc[:, ['Close','Round']],
left_index=True,
right_index=True,
how='right')
x.tail(10)
You'll notice the NaN in the "Close_x" column.