Building new dataframe from series - python-2.7

Taking another crack at an older question of mine since I still do not understand how to properly do what I want.
I have data stored in a dataframe and need to extract averaged chunks of it to use later. My index is datetime values, but this is not terribly important. Unfortunately, I cannot do a simple df.resample() operation, since the data I need to extract is not regularly spaced. Example:
import pandas as pd
from numpy import *
# Build example dataframe
df = pd.DataFrame(data=random.rand(10,3),index=None,columns=list('ABC'))
# Build dummy dataframe to store averaged data from "df"
dummy = pd.DataFrame(columns=df.columns)
# Perform averaging of "df"
for r in xrange(1,10,2):
ave = df.ix[r-1:r+1].mean()
# Store averaged data in dummy dataframe
# Here is where I hit my problem, since ave is a Series
dummy = dummy.append(ave)
I cannot append a series to dataframe.
I can work around by converting ave to a dictionary, then appending to dummy:
for r in xrange(1,10,2):
ave = df.ix[r-1:r+1].mean().to_dict()
ave = pd.DataFrame(ave,index=[r])
dummy = dummy.append(ave)
First: does my overall goal make sense?
Second: Is there a better way to achieve this? Converting to dictionary, then dataframe, then appending seems kludgey, but it is the best I have.
Begin Edit
unutbu raised a good point. As written, rolling_mean() will work. But, I am interested only in very few rows of data, everything else is considered garbage.
# Now creating larger dataframe for illustration
df = pd.DataFrame(data=random.rand(10000,3),index=None,columns=list('ABC'))
# Now, most of the data are not averaged
for r in xrange(1,10000,50):
ave = df.ix[r-1:r+1].mean().to_dict()
ave = pd.DataFrame(ave,index=[r])
The main problem I have with my examples is showing the irregularity with which the averaging is done. The averaging is event driven (i.e. if something happened at 2013-01-01 14:23 then average the data about 2013-01-01 14:23 +/- 2.5min.
Unfortunately, the data timestamps are also highly irregular, which makes rolling_mean() ineffective in this case. So I have irregular events determining when I should average my irregularly recorded data, making a nice problem.
I can achieve what I want, but only by converting ave from series to dictionary, then to dataframe. Perhaps in this case "good enough" should be let alone.
End Edit
dummy = dummy.append(ave)

It sounds like what you are looking for is pd.rolling_mean:
import pandas as pd
import numpy as np
np.random.seed(1)
# Build example dataframe
df = pd.DataFrame(data=np.random.rand(10,3), index=None, columns=list('ABC'))
print(df)
# A B C
# 0 0.417022 0.720324 0.000114
# 1 0.302333 0.146756 0.092339
# 2 0.186260 0.345561 0.396767
# 3 0.538817 0.419195 0.685220
# 4 0.204452 0.878117 0.027388
# 5 0.670468 0.417305 0.558690
# 6 0.140387 0.198101 0.800745
# 7 0.968262 0.313424 0.692323
# 8 0.876389 0.894607 0.085044
# 9 0.039055 0.169830 0.878143
dummy = pd.rolling_mean(df, window=3).dropna()
print(dummy)
yields
A B C
2 0.301872 0.404214 0.163073
3 0.342470 0.303837 0.391442
4 0.309843 0.547624 0.369792
5 0.471245 0.571539 0.423766
6 0.338436 0.497841 0.462274
7 0.593039 0.309610 0.683919
8 0.661679 0.468711 0.526037
9 0.627902 0.459287 0.551836

Here's another way with a datelike index.
In [67]: df = pd.DataFrame(data=np.random.rand(10,3), index=None, columns=list('ABC'))
In [68]: df
Out[68]:
A B C
0 0.417022 0.720324 0.000114
1 0.302333 0.146756 0.092339
2 0.186260 0.345561 0.396767
3 0.538817 0.419195 0.685220
4 0.204452 0.878117 0.027388
5 0.670468 0.417305 0.558690
6 0.140387 0.198101 0.800745
7 0.968262 0.313424 0.692323
8 0.876389 0.894607 0.085044
9 0.039055 0.169830 0.878143
This is a regular index, but irregular in time (or at least pretend)
In [69]: df.index=date_range('20130101 09:00:58',periods=10,freq='s')
In [70]: df
Out[70]:
A B C
2013-01-01 09:00:58 0.417022 0.720324 0.000114
2013-01-01 09:00:59 0.302333 0.146756 0.092339
2013-01-01 09:01:00 0.186260 0.345561 0.396767
2013-01-01 09:01:01 0.538817 0.419195 0.685220
2013-01-01 09:01:02 0.204452 0.878117 0.027388
2013-01-01 09:01:03 0.670468 0.417305 0.558690
2013-01-01 09:01:04 0.140387 0.198101 0.800745
2013-01-01 09:01:05 0.968262 0.313424 0.692323
2013-01-01 09:01:06 0.876389 0.894607 0.085044
2013-01-01 09:01:07 0.039055 0.169830 0.878143
Take every 3s of data (whether its their or not) and mean it (or you could do fancier if you want). Their are a bunch more options (e.g. which side to include, where to put the labels etc, see here
In [71]: df.resample('3s',how=lambda x: x.mean())
Out[71]:
A B C
2013-01-01 09:00:57 0.359677 0.433540 0.046226
2013-01-01 09:01:00 0.309843 0.547624 0.369792
2013-01-01 09:01:03 0.593039 0.309610 0.683919
2013-01-01 09:01:06 0.457722 0.532219 0.481593

Related

Reducing the Sparsity of a One-Hot Encoded dataset

I'm trying to do some feature selection algorithms on the UCI adult data set and I'm running into a problem with Univaraite feature selection. I'm doing onehot encoding on all the categorical data to change them to numerical but that gives me a lot of f scores.
How can I avoid this? What should I do to make this code better?
# Encode
adult['Gender'] = adult['sex'].map({'Female': 0, 'Male': 1}).astype(int)
adult = adult.drop(['sex'], axis=1)
adult['Earnings'] = adult['income'].map({'<=50K': 0, '>50K': 1}).astype(int)
adult = adult.drop(['income'], axis=1)
#OneHot Encode
adult = pd.get_dummies(adult, columns=["race"])
target = adult["Earnings"]
data = adult.drop(["Earnings"], axis=1)
selector = SelectKBest(f_classif, k=5)
selector.fit_transform(data, target)
for n,s in zip( data.head(0), selector.scores_):
print "F Score ", s,"for feature ", n
EDIT:
Partial results of current code:
F Score 26.1375747945 for feature race_Amer-Indian-Eskimo
F Score 3.91592196913 for feature race_Asian-Pac-Islander
F Score 237.173133254 for feature race_Black
F Score 31.117798305 for feature race_Other
F Score 218.117092671 for feature race_White
Expected Results:
F Score "f_score" for feature "race"
By doing the one hot encoding the feature in above is split into many sub-features, where I would just like to generalize it to just race (see Expected Results) if that is possible.
One way in which you can reduce the number of features, whilst still encoding your categories in a non-ordinal manner, is by using binary encoding. One-hot-encoding has a linear growth rate n where n is the number of categories in a categorical feature. Binary encoding has log_2(n) growth rate. In other words, doubling the number of categories adds a single column for binary encoding, where as it doubles the number of columns for one-hot encoding.
Binary encoding can be easily implemented in python by using the categorical_encoding package. The package is pip installable and works very seamlessly with sklearn and pandas. Here is an example
import pandas as pd
import category_encoders as ce
df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})
enc_bin = ce.binary_encoding.BinaryEncoding(cols=['cat1']) # cols=None, all string columns encoded
df_trans = enc_bin.fit_transform(df)
print(df_trans)
Out[1]:
cat1_0 cat1_1 cat2
0 1 1 C
1 0 1 S
2 1 0 T
3 0 0 B
Here is the code from a previous answer by me using the same variables as above but with one-hot encoding. Lets compare how the two different outputs look.
import pandas as pd
import category_encoders as ce
df = pd.DataFrame({'cat1':['A','N','K','P'], 'cat2':['C','S','T','B']})
enc_ohe = ce.one_hot.OneHotEncoder(cols=['cat1']) # cols=None, all string columns encoded
df_trans = enc_ohe.fit_transform(df)
print(df_trans)
Out[2]:
cat1_0 cat1_1 cat1_2 cat1_3 cat2
0 0 0 1 0 C
1 0 0 0 1 S
2 1 0 0 0 T
3 0 1 0 0 B
See how binary encoding uses half as many columns to uniquely describe each category within the category cat1.

Python, calculating time difference

I'm parsing logs generated from multiple sources and joined together to form a huge log file in the following format;
My_testNumber: 14, JobType = testx.
ABC 2234
**SR 111**
1483529571 1 1 Wed Jan 4 11:32:51 2017 0 4
datatype someRandomValue
SourceCode.Cpp 588
DBConnection failed
TB 132
**SR 284**
1483529572 0 1 Wed Jan 4 11:32:52 2017 5010400 4
datatype someRandomXX
SourceCode2.cpp 455
DBConnection Success
TB 102
**SR 299**
1483529572 0 1 **Wed Jan 4 11:32:54 2017** 5010400 4
datatype someRandomXX
SourceCode3.cpp 455
ConnectionManager Success
....
(there are dozens of SR Numbers here)
Now i'm looking a smart way to parse logs so that it calculates time differences in seconds for each testNumber and SR number
like
My_testNumber:14 it subtracts SR 284 and SR 111 time (difference would be 1 second here), for SR 284 and 299 it is 2 seconds and so on.
You can parse your posted log file and save the corresponding data accordingly. Then, you can work with the data to get the time differences. The following should be a decent start:
from itertools import combinations
from itertools import permutations # if order matters
from collections import OrderedDict
from datetime import datetime
import re
sr_numbers = []
dates = []
# Loop through the file and get the test number and times
# Save the data in a list
pattern = re.compile(r"(.*)\*{2}(.*)\*{2}(.*)")
for line in open('/Path/to/log/file'):
if '**' in line:
# Get the data between the asterisks
if 'SR' in line:
sr_numbers.append(re.sub(pattern,"\\2", line.strip()))
else:
dates.append(datetime.strptime(re.sub(pattern,"\\2", line.strip()), '%a %b %d %H:%M:%S %Y'))
else:
continue
# Use hashmap container (ordered dictionary) to make it easy to get the time differences
# Using OrderedDict here to maintain the order of the order of the test number along the file
log_dict = OrderedDict((k,v) for k,v in zip(sr_numbers, dates))
# Use combinations to get the possible combinations (or permutations if order matters) of time differences
time_differences = {"{} - {}".format(*x):(log_dict[x[1]] - log_dict[x[0]]).seconds for x in combinations(log_dict, 2)}
print(time_differences)
# {'SR 284 - SR 299': 2, 'SR 111 - SR 284': 1, 'SR 111 - SR 299': 3}
Edit:
Parsing the file without relying on the asterisks around the dates:
from itertools import combinations
from itertools import permutations # if order matters
from collections import OrderedDict
from datetime import datetime
import re
sr_numbers = []
dates = []
# Loop through the file and get the test number and times
# Save the data in a list
pattern = re.compile(r"(.*)\*{2}(.*)\*{2}(.*)")
for line in open('/Path/to/log/file'):
if 'SR' in line:
current_sr_number = re.sub(pattern,"\\2", line.strip())
sr_numbers.append(current_sr_number)
elif line.strip().count(":") > 1:
try:
dates.append(datetime.strptime(re.split("\s{3,}",line)[2].strip("*"), '%a %b %d %H:%M:%S %Y'))
except IndexError:
#print(re.split("\s{3,}",line))
dates.append(datetime.strptime(re.split("\t+",line)[2].strip("*"), '%a %b %d %H:%M:%S %Y'))
else:
continue
# Use hashmap container (ordered dictionary) to make it easy to get the time differences
# Using OrderedDict here to maintain the order of the order of the test number along the file
log_dict = OrderedDict((k,v) for k,v in zip(sr_numbers, dates))
# Use combinations to get the possible combinations (or permutations if order matters) of time differences
time_differences = {"{} - {}".format(*x):(log_dict[x[1]] - log_dict[x[0]]).seconds for x in combinations(log_dict, 2)}
print(time_differences)
# {'SR 284 - SR 299': 2, 'SR 111 - SR 284': 1, 'SR 111 - SR 299': 3}
I hope this proves useful.

Looping through file with .ix and .isin

My original data looks like this:
SUBBASIN HRU HRU_SLP OV_N
1 1 0.016155144 0.15
1 2 0.015563287 0.14
2 1 0.010589782 0.15
2 2 0.011574839 0.14
3 1 0.013865396 0.15
3 2 0.01744597 0.15
3 3 0.018983217 0.14
3 4 0.013890315 0.05
3 5 0.011792533 0.05
I need to modify value of OV_N for each SUBBASIN number:
hru = pd.read_csv('hru.csv')
for i in hru.OV_N:
hru.ix[hru.SUBBASIN.isin([76,65,64,72,81,84,60,46,37,1,2]), 'OV_N'] = i*(1+df21.value[12])
hru.ix[hru.SUBBASIN.isin([80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12]), 'OV_N'] = i*(1+df23.value[12])
hru.ix[hru.SUBBASIN.isin([85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,49,29,22,24,25,9,10]), 'OV_N'] = i*(1+df56.value[12])
hru.ix[hru.SUBBASIN.isin([92,88,95,94,93]), 'OV_N'] = i*(1+df58.value[12])
where df21.value[12] is a value from a txt file
The code results in an infinite value of OV_N for all subbasins, so I assume that looping through a file goes multiple times, but I can't find a mistake and this code was working before with different numbers of subbasins.
It is generally better not to loop and index over rows in a pandas DataFrame. Transforming the DataFrame by column operations is the more pandasnic approach. A pandas DataFrame can be thought of as a zipped combination of pandas Series: each column is its own pandas Series – all sharing the same index. Operations can be applied to one or more pandas Series to create a new Series that shares the same index. Operations can also be applied to combine a Series with one dimensional numpy array to create a new Series. It is helpful to understand pandas indexing – however this answer will just use sequential integer indexing.
To modify the value of OV_N for each SUBBASIN number:
Initialize the hru DataFrame by reading it in from the hru.csv as in the original question. Here we initialize it with the data given in the question.
import numpy as np
import pandas as pd
hru = pd.DataFrame({
'SUBBASIN':[1,1,2,2,3,3,3,3,3],
'HRU':[1,2,1,2,1,2,3,4,5],
'HRU_SLP':[0.016155144,0.015563287,0.010589782,0.011574839,0.013865396,0.01744597,0.018983217,0.013890315,0.011792533],
'OV_N':[0.15,0.14,0.15,0.14,0.15,0.15,0.14,0.05,0.05]})
Create one separate pandas Series that gathers and stores all the values from the various DataFrames, i.e. df21, df23, df56, and df58, into one place. This will be used to look up values by index. Let’s call it subbasin_multiplier_ds. Let’s respectively assume values of 21, 23, 56, and 58 were read from the txt file. Do replace these with the real values read in from the txt file.
subbasin_multiplier_ds=pd.Series([21]*96)
subbasin_multiplier_ds[80,74,75,66,55,53,57,63,61,41,38,27,26,45,40,
34,35,31,33,21,20,17,18,19,23,14,13,8,7,11,6,4,3,5,12] = 23
subbasin_multiplier_ds[85,58,78,54,59,51,52,30,28,16,15,77,79,71,70,
86,73,68,69,56,67,62,82,87,83,91,89,90,43,36,39,47,32,49,42,48,50,
49,29,22,24,25,9,10] = 56
subbasin_multiplier_ds[92,88,95,94,93] = 58
Replace OV_N in hru DataFrame based on columns in the DataFrame and a lookup in subbasin_multiplier_ds by index.
hru['OV_N'] = hru['OV_N'] * (1 + subbasin_multiplier_ds[hru['SUBBASIN']].values)
A numpy array is created by .values above so expected results are achieved. If you want to experiment with removing values give it a try to see what happens.

Pandas Interpolate Returning NaNs

I'm trying to do basic interpolation of position data at 60hz (~16ms) intervals. When I try to use pandas 0.14 interpolation over the dataframe, it tells me I only have NaNs in my data set (not true). When I try to run it over individual series pulled from the dataframe, it returns the same series without the NaNs filled in. I've tried setting the indices to integers, using different methods, fiddling with the axis and limit parameters of the interpolation function - no dice. What am I doing wrong?
df.head(5) :
x y ms
0 20.5815 14.1821 333.3333
1 NaN NaN 350
2 20.6112 14.2013 366.6667
3 NaN NaN 383.3333
4 20.5349 14.2232 400
df = df.set_index(df.ms) # set indices to milliseconds
When I try running
df.interpolate(method='values')
I get this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-462-cb0f1f01eb84> in <module>()
12
13
---> 14 df.interpolate(method='values')
15
16
/Users/jsb/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in interpolate(self, method, axis, limit, inplace, downcast, **kwargs)
2511
2512 if self._data.get_dtype_counts().get('object') == len(self.T):
-> 2513 raise TypeError("Cannot interpolate with all NaNs.")
2514
2515 # create/use the index
TypeError: Cannot interpolate with all NaNs.
I've also tried running over individual series, which only return what I put in:
temp = df.x
temp.interpolate(method='values')
333.333333 20.5815
350.000000 NaN
366.666667 20.6112
383.333333 NaN
400.000000 20.5349 Name: x, dtype: object
EDIT :
Props to Jeff for inspiring the solution.
Adding:
df[['x','y','ms']] = df[['x','y','ms']].astype(float)
before
df.interpolate(method='values')
interpolation did the trick.
Based on your edit with props to Jeff for inspiring the solution.
Adding:
df = df.astype(float)
before
df.interpolate(method='values')
interpolation did the trick for me as well. Unless you're sub-selecting a column set, you don't need to specify the columns.
I'm not able to to reproduce the error (see below for a copy/paste-able example), can you make sure the the data you show is actually representative of your data?
In [137]: from StringIO import StringIO
In [138]: df = pd.read_csv(StringIO(""" x y ms
...: 0 20.5815 14.1821 333.3333
...: 1 NaN NaN 350
...: 2 20.6112 14.2013 366.6667
...: 3 NaN NaN 383.3333
...: 4 20.5349 14.2232 400"""), delim_whitespace=True)
In [140]: df = df.set_index(df.ms)
In [142]: df.interpolate(method='values')
Out[142]:
x y ms
ms
333.3333 20.58150 14.18210 333.3333
350.0000 20.59635 14.19170 350.0000
366.6667 20.61120 14.20130 366.6667
383.3333 20.57305 14.21225 383.3333
400.0000 20.53490 14.22320 400.0000

Appending Tuples to Pandas DataFrame

I am trying to join (vertically) some of the tuples, better to say I am inserting these tuples in the dataframe. But unable to do so till now. Problem arises since I am trying to add them horizentally and not vertically.
data_frame = pandas.DataFrame(columns=("A","B","C","D"))
str1 = "Doodles are the logo-incorporating works of art that Google regularly features on its homepage. They began in 1998 with a stick figure by Google co-founders Larry Page and Sergey Brin -- to indicate they were attending the Burning Man festival. Since then the doodles have become works of art -- some of them high-tech and complex -- created by a team of doodlers. Stay tuned here for more of this year's doodles"
aa = str1.split()
bb = zip(aa[0:4])
data_frame.append(bb,ignore_index=True,verify_integrity=False)
Is it possible or do I have to iterate over each word in tuple to use insert
You could do this
In [8]: index=list('ABCD')
In [9]: df = pd.DataFrame(columns=index)
In [11]: df.append(Series(aa[0:4],index=index),ignore_index=True)
Out[11]:
A B C D
0 Doodles are the logo-incorporating
alternatively if you have many of these rows that you are going to append, just
create a list, then DataFame(list_of_series) at the end
In [13]: DataFrame([ aa[0:4], aa[5:8] ],columns=list('ABCD'))
Out[13]:
A B C D
0 Doodles are the logo-incorporating
1 of art that None