I already sorted the data and the dataframe now is like this:
Tr Srate(V/ns)mean Tf Srate(V/ns)mean
CPULabel
100HiBW_Fast 3.16 3.09
100LoBW_Fast 3.16 3.09
BP100_Fast 3.16 3.06
My dataframe is slew_rate_max. I tried to use:
slew_rate_max.max()
I expected the result to be 3.16. However, I got the max values of both columns individually.
Tr Srate(V/ns)mean 3.16
Tf Srate(V/ns)mean 3.09
dtype: float64
How to get the max value of the whole dataframe not of each column?
Try:
slew_rate_max.max().max()
slew_rate_max.max() returns a series with the max of each column, then taking the max again of that series will give you the max of the entire dataframe.
or
slew_rate_max.values.max()
slew_rate_max.values converts the dataframe to a np.ndarray then using numpy.ndarray.max which as an argument axis that the default is None, this gives the max of the entire ndarray.
Related
I need to build a DataFrame with a very specific structure. Yield curve values as the data, a single date as the index, and days to maturity as the column names.
In[1]: yield_data # list of size 38, with yield values
Out[1]:
[0.096651956137087325,
0.0927199778042056,
0.090000225505577847,
0.088300016028163508,...
In[2]: maturity_data # list of size 38, with days until maturity
Out[2]:
[6,
29,
49,
70,...
In[3]: today
Out[3]:
Timestamp('2017-07-24 00:00:00')
Then I try to create the DataFrame
pd.DataFrame(data=yield_data, index=[today], columns=maturity_data)
but it returns the error
ValueError: Shape of passed values is (1, 38), indices imply (38, 1)
I tried using the transpose of these lists, but it does not allow to transpose them.
how can I create this DataFrame?
IIUC, I think you want a dataframe with a single row, you need to reshape your data input list into a list of list.
yield_data = [0.09,0.092, 0.091]
maturity_data = [6,10,15]
today = pd.to_datetime('2017-07-25')
pd.DataFrame(data=[yield_data],index=[today],columns=maturity_data)
Output:
6 10 15
2017-07-25 0.09 0.092 0.091
I have a pandas dataframe with timestamps as index:
I would like to convert it to get a dataframe with daily values but without having to resample the original dataframe (no to sum, or average the hourly data). Ideally I would like to get the 24 daily values in a vector for each day, for example:
Is there a method to do this quickly?
Thanks!
IIUC you can groupby on the date attribute of your index and then apply a lambda that aggregates the values into a list:
In [21]:
# generate some data
df = pd.DataFrame({'GFS_rad':np.random.randn(100), 'GFS_tmp':np.random.randn(100)}, index=pd.date_range(dt.datetime(2016,1,1), freq='1h', periods=100))
df.groupby(df.index.date)['GFS_rad','GFS_tmp'].agg(lambda x: [x['GFS_rad'].values,x['GFS_tmp'].values])
Out[21]:
GFS_rad \
2016-01-01 [-0.324115177542, 1.59297335764, 0.58118555943...
2016-01-02 [-0.0547016526463, -1.10093451797, -1.55790161...
2016-01-03 [-0.34751220092, 1.06246918632, 0.181218794826...
2016-01-04 [0.950977469848, 0.422905080529, 1.98339145764...
2016-01-05 [-0.405124861624, 0.141470757613, -0.191169333...
GFS_tmp
2016-01-01 [-2.36889710412, -0.557972678049, -1.293544410...
2016-01-02 [-0.125562429825, -0.018852674365, -0.96735945...
2016-01-03 [0.802961514703, -1.68049099535, -0.5116769061...
2016-01-04 [1.35789157665, 1.37583167965, 0.538638510171,...
2016-01-05 [-0.297611872638, 1.10546853812, -0.8726761667...
So I have a dataframe, call it TABLE and I'm using Pandas with Python 2.7 to analyze it. It's mostly categorical data so right now my goal is to have a summary of my table where I list each column name and the average length of the values in that column.
Example table:
A B C E F
0 djsdd 973 348f NaN abcd
1 dsa 49 34h5 NaN NaN
Then my desired output would be something like:
Column AvgLength
A 4.0
B 2.5
C 4.0
E NaN
F 4.0
Now the first problem I had was that there are some numerical values in the dataset. I thought I could resolve that by using .astype(str) so I did the following:
for k in TABLE:
print "%s\t %s"%(k,TABLE[k].astype(str).str.len().mean())
The issue now is that it looks to me like .astype(str) is converting the null values to strings because I ended up with the following output:
Column AvgLength
A 4.0
B 2.5
C 4.0
E 3.0
F 3.5
Notice that column E containing the null values is giving me an average length of 3, and column F is giving me an average of 3.5. My understanding is this happened because it's taking the length of the string "NaN."
Is there some way to do what I want and ignore the Null values? Or is there a completely different approach I should be taking (I'm very new to pandas)?
(I did read about .dropna() but I don't want to omit all columns that might contain null values because some columns may have null values alongside data. I want to just ignore the null values from my mean).
stack to get series
dropna to get rid of NaN
astype(str).str.len() to get lengths
unstack().mean() for average length
reindex(TABLE.columns) to ensure we get all original columns represented
TABLE.stack().dropna().astype(str).str.len().unstack().mean().reindex(TABLE.columns)
A 4.0
B 2.5
C 4.0
E NaN
dtype: float64
I'm having trouble with calculating the mean of Timestamps.
I have a few values with Timestamps in my Data Frame, and I want to aggregate the values into a single value with the sum of all values and the weighted mean of the appropriate Timestamps
My input is:
Timestamp Value
ID
0 2013-02-03 13:39:00 79
0 2013-02-03 14:03:00 19
1 2013-02-04 11:36:00 2
2 2013-02-04 12:07:00 2
3 2013-02-04 14:04:00 1
And I want to aggregate the data using the ID index.
I was able to sum the Values using
manp_func = {'Value':['sum'] }
new_table =table.groupby(level='ID).agg(manp_func)
but, how can I find the weighted mean of the Timestamps related to the values?
Thanks
S.A
agg = lambda x: (x['Timestamp'].astype('i8') * (x['Value'].astype('f8') / x['Value'].sum())).sum()
new_table = table.groupby(level='ID').apply(agg).astype('i8').astype('datetime64[ns]')
Output of new_table
ID
0 2013-02-03 13:43:39.183673344
2 2013-02-04 11:51:30.000000000
3 2013-02-04 14:04:00.000000000
dtype: datetime64[ns]
The main idea is to compute the weighted average as normal, but there are a couple of subtleties:
You have to convert the datetime64[ns] to an integer offset first because multiplication is not defined between those two types. Then you have to convert it back.
Calculating the weighted sum as sum(a*w)/sum(w) will result in overflow (a*w is too large to be represented as an 8-byte integer), so it has to be calculated as sum(a*(w/sum(w)).
Preparing a sample dataframe:
# Initiate dataframe
date_var = "date"
df = pd.DataFrame(data=[['A', '2018-08-05 17:06:01'],
['A', '2018-08-05 17:06:02'],
['A', '2018-08-05 17:06:03'],
['B', '2018-08-05 17:06:07'],
['B', '2018-08-05 17:06:09'],
['B', '2018-08-05 17:06:11']],
columns=['column', date_var])
# Convert date-column to proper pandas Datetime-values/pd.Timestamps
df[date_var] = pd.to_datetime(df[date_var])
Extraction of the desired average Timestamp-value:
# Extract the numeric value associated to each timestamp (epoch time)
# NOTE: this is being accomplished via accessing the .value - attribute of each Timestamp in the column
In:
[tsp.value for tsp in df[date_var]]
Out:
[
1533488761000000000, 1533488762000000000, 1533488763000000000,
1533488767000000000, 1533488769000000000, 1533488771000000000
]
# Use this to calculate the mean, then convert the result back to a timestamp
In:
pd.Timestamp(np.nanmean([tsp.value for tsp in df[date_var]]))
Out:
Timestamp('2018-08-05 17:06:05.500000')
I have a time series similar to:
ts = pd.Series(np.random.randn(60),index=pd.date_range('1/1/2000',periods=60, freq='2h'))
Is there an easy way to make it so that the row index is dates and the column index is the hour?
Basically I am trying to convert from a time-series into a dataframe.
There's always a slicker way to do things than the way I reach for, but I'd make a flat frame first and then pivot. Something like
>>> ts = pd.Series(np.random.randn(10000),index=pd.date_range('1/1/2000',periods=10000, freq='10min'))
>>> df = pd.DataFrame({"date": ts.index.date, "time": ts.index.time, "data": ts.values})
>>> df = df.pivot("date", "time", "data")
This produces too large a frame to paste, but looking the top left corner:
>>> df.iloc[:5, :5]
time 00:00:00 00:10:00 00:20:00 00:30:00 00:40:00
date
2000-01-01 -0.180811 0.672184 0.098536 -0.687126 -0.206245
2000-01-02 0.746777 0.630105 0.843879 -0.253666 1.337123
2000-01-03 1.325679 0.046904 0.291343 -0.467489 -0.531110
2000-01-04 -0.189141 -1.346146 1.378533 0.887792 2.957479
2000-01-05 -0.232299 -0.853726 -0.078214 -0.158410 0.782468
[5 rows x 5 columns]