Converting datetime to pandas index - python-2.7

My pandas dataframe is structured as follows:
date tag
0 2015-07-30 19:19:35-04:00 E7RG6
1 2016-01-27 08:20:01-05:00 ER57G
2 2015-11-15 23:32:16-05:00 EQW7G
3 2016-07-12 00:01:11-04:00 ERV7G
4 2016-02-14 00:35:21-05:00 EQW7G
5 2016-03-01 00:08:59-05:00 EQW7G
6 2015-06-19 07:15:06-04:00 ER57G
7 2016-09-08 18:17:53-04:00 ER5TT
8 2016-09-03 01:53:45-04:00 EQW7G
9 2015-11-30 09:31:02-05:00 ER57G
10 2016-03-03 22:28:26-05:00 ES5TG
11 2016-02-11 10:39:24-05:00 E5P7G
12 2015-03-16 07:18:47-04:00 ER57G
...
[11015 rows x 2 columns]
date datetime64[ns, America/New_York]
tag object
dtype: object
I'm attempting to set the column 'date' as the index:
df = df.set_index(pd.DatetimeIndex(df['date']))
which yields the following error (using pandas 0.19)
File "pandas/tslib.pyx", line 3753, in pandas.tslib.tz_localize_to_utc (pandas/tslib.c:64516)
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 01:38:12'), try using the 'ambiguous' argument
I've consulted this, but I'm still unable to work through this error. For example,
df = df.set_index(pd.DatetimeIndex(df['date']), ambiguous='infer')
yields:
File "pandas/tslib.pyx", line 3703, in pandas.tslib.tz_localize_to_utc (pandas/tslib.c:63553)
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2015-11-01 01:38:12 asthere are no repeated times
Any advice on how to convert the datetime column to the index would be greatly appreciated.

If your dtype for a column is already datetime then you can just call set_index without the need to try to construct a DatetimeIndex from the column:
df.set_index(df['date'], inplace=True)
should just work, the dtype for the index is sniffed out so there is no need to construct an index object from the Series/column here.

Related

Is there a code for separating alphabets from integers from a string in an excel sheet using pandas? [duplicate]

This question already has answers here:
How to split a column into alphabetic values and numeric values from a column in a Pandas dataframe?
(4 answers)
Closed 3 years ago.
I'm working in a company project, the guys collected data and put it in excel sheet. And they want me to separate the integers from alphabets using regex under Barcode_Number column. Is the a way I can do that for all the values under Barcode_Number Column?
import numpy as np
import re
data = pd.read_excel(r'C:\Users\yanga\Gaussian\SEC - 6. Yanga Deliverables\Transmission\Raw\3000_2- processed.xlsx')
data.head()
# Extract the column you want to work with
df = pd.DataFrame(data, columns= ['Barcode_Number'])
# Identify the null values
df.isnull().sum()
# remove all the null values
df.dropna(how = 'all', inplace = True)
# Select cells that contain non-digit values
df1 = df[df['Barcode_Number'].str.contains('^\D', na = False)]
For example if I have list of values under column Barcode_Number
Barcode_Number
'VQA435'
'KSR436'
'LAR437'
'ARB438'
and I want an output to be like this:
'VQA', '435'
'KSR', '436'
'LAR', '437'
'ARB', '438'
import pandas as pd
df = pd.read_csv(filename)
df[["Code", "Number"]] = df["Barcode_Number"].str.extract(r"([A-Z]+)([0-9]+)")
print(df)
Output:
Barcode_Number Code Number
0 VQA435 VQA 435
1 KSR436 KSR 436
2 LAR437 LAR 437
3 ARB438 ARB 438

Looping through file with .ix and .isin

My original data looks like this:
SUBBASIN HRU HRU_SLP OV_N
1 1 0.016155144 0.15
1 2 0.015563287 0.14
2 1 0.010589782 0.15
2 2 0.011574839 0.14
3 1 0.013865396 0.15
3 2 0.01744597 0.15
3 3 0.018983217 0.14
3 4 0.013890315 0.05
3 5 0.011792533 0.05
I need to modify value of OV_N for each SUBBASIN number:
hru = pd.read_csv('hru.csv')
for i in hru.OV_N:
hru.ix[hru.SUBBASIN.isin([76,65,64,72,81,84,60,46,37,1,2]), 'OV_N'] = i*(1+df21.value[12])
hru.ix[hru.SUBBASIN.isin([80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12]), 'OV_N'] = i*(1+df23.value[12])
hru.ix[hru.SUBBASIN.isin([85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,49,29,22,24,25,9,10]), 'OV_N'] = i*(1+df56.value[12])
hru.ix[hru.SUBBASIN.isin([92,88,95,94,93]), 'OV_N'] = i*(1+df58.value[12])
where df21.value[12] is a value from a txt file
The code results in an infinite value of OV_N for all subbasins, so I assume that looping through a file goes multiple times, but I can't find a mistake and this code was working before with different numbers of subbasins.
It is generally better not to loop and index over rows in a pandas DataFrame. Transforming the DataFrame by column operations is the more pandasnic approach. A pandas DataFrame can be thought of as a zipped combination of pandas Series: each column is its own pandas Series – all sharing the same index. Operations can be applied to one or more pandas Series to create a new Series that shares the same index. Operations can also be applied to combine a Series with one dimensional numpy array to create a new Series. It is helpful to understand pandas indexing – however this answer will just use sequential integer indexing.
To modify the value of OV_N for each SUBBASIN number:
Initialize the hru DataFrame by reading it in from the hru.csv as in the original question. Here we initialize it with the data given in the question.
import numpy as np
import pandas as pd
hru = pd.DataFrame({
'SUBBASIN':[1,1,2,2,3,3,3,3,3],
'HRU':[1,2,1,2,1,2,3,4,5],
'HRU_SLP':[0.016155144,0.015563287,0.010589782,0.011574839,0.013865396,0.01744597,0.018983217,0.013890315,0.011792533],
'OV_N':[0.15,0.14,0.15,0.14,0.15,0.15,0.14,0.05,0.05]})
Create one separate pandas Series that gathers and stores all the values from the various DataFrames, i.e. df21, df23, df56, and df58, into one place. This will be used to look up values by index. Let’s call it subbasin_multiplier_ds. Let’s respectively assume values of 21, 23, 56, and 58 were read from the txt file. Do replace these with the real values read in from the txt file.
subbasin_multiplier_ds=pd.Series([21]*96)
subbasin_multiplier_ds[80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,
34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12] = 23
subbasin_multiplier_ds[85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,
86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,
49,29,22,24,25,9,10] = 56
subbasin_multiplier_ds[92,88,95,94,93] = 58
Replace OV_N in hru DataFrame based on columns in the DataFrame and a lookup in subbasin_multiplier_ds by index.
hru['OV_N'] = hru['OV_N'] * (1 + subbasin_multiplier_ds[hru['SUBBASIN']].values)
A numpy array is created by .values above so expected results are achieved. If you want to experiment with removing values give it a try to see what happens.

Pandas 0.17.1 error subtracting datetime64[ns] cols if dataframe is empty

I just updated from pandas 0.16.2 to 0.17.1 and am now getting this error when executing existing code that worked fine in 0.16.2
TypeError: ufunc substract cannot use operands with types dtype('<M8[ns]') and dtype('O')
I'm trying to subtract two datetime64 cols on a subset of a dataframe given certain criteria. The error only seems to happen when the subset is empty. In 0.16.2 the below sample code returns a new column D with NaN values, but in 0.17.1 I get the above error.
Any ideas? I need to get this code working again and was hoping not to downgrade back to 0.16.2
df = pd.DataFrame({'A': [1,2], 'B': ['2016-02-11 05:15:34', '2016-02-11 04:04:54'], 'C': ['2016-02-11 14:08:02', '2016-02-12 01:58:51']})
df['B'] = pd.to_datetime(df['B'])
df['C'] = pd.to_datetime(df['C'])
ix = df['A']==3
df.loc[ix, 'D'] = (df[ix]['C'] - df[ix]['B']) / np.timedelta64(1, 's')
df returned in 0.16.2:
A B C D
0 1 2016-02-11 05:15:34 2016-02-11 14:08:02 NaN
1 2 2016-02-11 04:04:54 2016-02-12 01:58:51 NaN

AttributeError: 'DataFrame' object has no attribute 'Height'

I am able to convert a csv file to pandas DataFormat and able to print out the table, as seen below. However, when I try to print out the Height column I get an error. How can I fix this?
import pandas as pd
df = pd.read_csv('/path../NavieBayes.csv')
print df #this prints out as seen below
print df.Height #this gives me the "AttributeError: 'DataFrame' object has no attribute 'Height'
Height Weight Classifer
0 70.0 180 Adult
1 58.0 109 Adult
2 59.0 111 Adult
3 60.0 113 Adult
4 61.0 115 Adult
I have run into a similar issue before when reading from csv. Assuming it is the same:
col_name =df.columns[0]
df=df.rename(columns = {col_name:'new_name'})
The error in my case was caused by (I think) by a byte order marker in the csv or some other non-printing character being added to the first column label. df.columns returns an array of the column names. df.columns[0] gets the first one. Try printing it and seeing if something is odd with the results.
PS On above answer by JAB - if there is clearly spaces in your column names use skipinitialspace=True in read_csv e.g.
df = pd.read_csv('/path../NavieBayes.csv',skipinitialspace=True)
df = pd.read_csv(r'path_of_file\csv_file_name.csv')
OR
df = pd.read_csv('path_of_file/csv_file_name.csv')
Example:
data = pd.read_csv(r'F:\Desktop\datasets\hackathon+data+set.csv')
Try it, it will work.

Pandas Interpolate Returning NaNs

I'm trying to do basic interpolation of position data at 60hz (~16ms) intervals. When I try to use pandas 0.14 interpolation over the dataframe, it tells me I only have NaNs in my data set (not true). When I try to run it over individual series pulled from the dataframe, it returns the same series without the NaNs filled in. I've tried setting the indices to integers, using different methods, fiddling with the axis and limit parameters of the interpolation function - no dice. What am I doing wrong?
df.head(5) :
x y ms
0 20.5815 14.1821 333.3333
1 NaN NaN 350
2 20.6112 14.2013 366.6667
3 NaN NaN 383.3333
4 20.5349 14.2232 400
df = df.set_index(df.ms) # set indices to milliseconds
When I try running
df.interpolate(method='values')
I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-462-cb0f1f01eb84> in <module>()
12
13
---> 14 df.interpolate(method='values')
15
16
/Users/jsb/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in interpolate(self, method, axis, limit, inplace, downcast, **kwargs)
2511
2512 if self._data.get_dtype_counts().get('object') == len(self.T):
-> 2513 raise TypeError("Cannot interpolate with all NaNs.")
2514
2515 # create/use the index
TypeError: Cannot interpolate with all NaNs.
I've also tried running over individual series, which only return what I put in:
temp = df.x
temp.interpolate(method='values')
333.333333 20.5815
350.000000 NaN
366.666667 20.6112
383.333333 NaN
400.000000 20.5349 Name: x, dtype: object
EDIT :
Props to Jeff for inspiring the solution.
Adding:
df[['x','y','ms']] = df[['x','y','ms']].astype(float)
before
df.interpolate(method='values')
interpolation did the trick.
Based on your edit with props to Jeff for inspiring the solution.
Adding:
df = df.astype(float)
before
df.interpolate(method='values')
interpolation did the trick for me as well. Unless you're sub-selecting a column set, you don't need to specify the columns.
I'm not able to to reproduce the error (see below for a copy/paste-able example), can you make sure the the data you show is actually representative of your data?
In [137]: from StringIO import StringIO
In [138]: df = pd.read_csv(StringIO(""" x y ms
...: 0 20.5815 14.1821 333.3333
...: 1 NaN NaN 350
...: 2 20.6112 14.2013 366.6667
...: 3 NaN NaN 383.3333
...: 4 20.5349 14.2232 400"""), delim_whitespace=True)
In [140]: df = df.set_index(df.ms)
In [142]: df.interpolate(method='values')
Out[142]:
x y ms
ms
333.3333 20.58150 14.18210 333.3333
350.0000 20.59635 14.19170 350.0000
366.6667 20.61120 14.20130 366.6667
383.3333 20.57305 14.21225 383.3333
400.0000 20.53490 14.22320 400.0000