dataframe from 3d list - list

this is what i have (a 3D list) :
last_price price2 price3
0 0.00 0.0 0.0
1 870.95 7650.0 2371500.0
2 870.95 7650.0 2371500.0
3 870.95 7650.0 2371500.0
4 877.30 7650.0 2371500.0
5 879.20 6800.0 2381700.0]
I want to create a dataframe exactly like the list that I have above. how do I do so? thank you very much.. i tried pd.DataFrame(the_list) but it gave me this error: ValueError: Must pass 2-d input. shape=(190, 6, 3).. thanks


Getting index values from pd mean() and std() functions

I'm trying to get the index values from a pd std().
My final objective is to match the index with another df and insert the corresponding values (standard deviations).
(in): df_std['index'] = df_std.index
(out): Index([u'AAPL US Equity', u'QQQ US Equity', u'BRABCBACNPR4 BZ Equity'...dtype='object')
However, I've been unable to add the indexes to the "right" of df_std because of the types: df_std.index is a series while df_std is a df. When I try to do it, a line is added instead of a column:
(in): df_std['index'] = df_std.index
BRSTNCLF1R25 Govt 64.0864
BRITUBACNPR1 BZ Equity 2.67762
BRSTNCNTB4O9 Govt 48.2419
BRSTNCLF1R74 Govt 64.901
PBR US Equity 0.770755
BRBBASACNOR3 BZ Equity 2.93335
BRSTNCLF1R82 Govt 65.0979
index Index([u'AAPL US Equity', u'QQQ US Equity', u'...
dtype: object
I've already tried converting it df_std.inde to a tuple and to a dataframe.
I'm trying to match df_std['index'] with df_final['bloomberg_ticker'] and bring the std values to df_final['std']:
(in): print df_final
serie tipo tp_cnpjfundo valor id bloomberg_ticker \
0 NaN caixa NaN NaN 0 NaN
1 NaN titpublicos NaN NaN 1 BRSTNCLF1R17 Govt
2 NaN titpublicos NaN NaN 2 BRSTNCLF1R17 Govt
3 NaN titpublicos NaN NaN 3 BRSTNCLF1R25 Govt
(the column 'id' will be deleted later)
Use .reset_index() than assigning if what you have is a dataframe i.e
df_std = df_std.reset_index()
Example :
df = pd.DataFrame([0,1,2,3], index=['a','b','c','d'])
df = df.reset_index()
Output :
index 0
0 a 0
1 b 1
2 c 2
3 d 3
In case what you have is a series, convert that to dataframe then reset_index i.e if df_std is the series you have then
df_std = df_std.to_frame().reset_index()
I think what are trying to do is map the values of series to a specific column so you can use
df = pd.DataFrame({'col':['a','b','c','d','e'],'vales':[5,1,2,4,5]})
s = pd.Series([1,2,3],index=['a','b','c'])
df['new'] = df['col'].map(s)
Output :
col vales new
0 a 5 1.0
1 b 1 2.0
2 c 2 3.0
3 d 4 NaN
4 e 5 NaN
In your case you can use df_final['index'].map(df_std)
For conditional check if the index of series is present int he index column of dataframe then you can use .isin i.e
df['col'].isin(s.index) # Returns the boolen mask
df[df['col'].isin(s.index)] #Returns the dataframe based matched index

Slicing a pandas column based on the position of a matching substring

I am trying to slice a pandas column called PATH from a DataFrame called dframe such that I would get the ad1 container's filename with the extension in a new column called AD1position.
0 \
1 \abc.ad1\xaxaxa
2 \defghij.ad1\wbcbcb
3 \tuvwxyz.ad1\ydeded
In other words, here's what I want to see:
PATH AD1position
0 \
1 \abc.ad1\xaxaxa abc.ad1
2 \defghij.ad1\wbcbcb defghij.ad1
3 \tuvwxyz.ad1\ydeded tuvwxyz.ad1
If I was to do this in Excel, I would write:
=if(iserror(search(".ad1",[PATH])),"",mid([PATH],2,search(".ad1",[PATH]) + 3))
In Python, I seem to be stuck. Here's what I wrote thus far:
dframe['AD1position'] = dframe['PATH'].apply(lambda x: x['PATH'].str[1:(x['PATH'].str.find('.ad1')) \
+ 3] if x['PATH'].str.find('.ad1') != -1 else "")
Doing this returns the following error:
TypeError: string indices must be integers
I suspect that the problem is caused by the function in the slicer, but I'd appreciate any help with figuring out how to resolve this.
use .str.extract() function:
In [17]: df['AD1position'] = df.PATH.str.extract(r'.*?([^\\]*\.ad1)', expand=True)
In [18]: df
PATH AD1position
0 \ NaN
1 \aaa\bbb NaN
2 \byz.ad1 byz.ad1
3 \abc.ad1\xaxaxa abc.ad1
4 \defghij.ad1\wbcbcb defghij.ad1
5 \tuvwxyz.ad1\ydeded tuvwxyz.ad1
This will get you the first element of the split.
df['AD1position'] = df.PATH.str.split('\\').str.get(1)
Thank you Root.

Not calculating sum for all columns in pandas dataframe

I'm pulling data from Impala using impyla, and converting them to dataframe using as_pandas. And I'm using Pandas 0.18.0, Python 2.7.9
I'm trying to calculate the sum of all columns in a dataframe and trying to select the columns which are greater than the threshold. =[:, > 15]
But when I run this I'm getting error like below:
pandas.core.indexing.IndexingError: Unalignable boolean Series key
Then I tried like below.
print 'length : ',len( = 0)),' all columns : ',len(
Then i'm getting different length i.e
length : 78 all columns : 83
And I'm getting below warning
C:\Python27\lib\ RuntimeWarning: tp_compare didn't
return -1 or -2 for exception
And To achieve my goal i tried the other way
for column in
sum =[column].sum()
if( sum < 15 ): =,1)
Now i have got the other errors like below:
TypeError: unsupported operand type(s) for +: 'Decimal' and 'float'
C:\Python27\lib\ RuntimeWarning: tp_compare didn't return -1 or -2 for exception
Then i tried to get the data types of each column like below.
print 'dtypes : ',
The result has all the columns are one of these int64 , object and float 64
Then i thought of changing the data type of columns which are in object like below
Still i'm getting the same errors, Please help me in solving this.
Note : In all the columns I do not have strings i.e characters and missing values or empty.I have checked this using
As i'm new to pandas and python Please don't mind if it is a silly question. I just want to learn
Please review the simple code below and you may understand the reason of the error.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random([3,3]))
df.iloc[0,0] = np.nan
print df
print df.sum(axis=0) > 1.5
print df.loc[:, df.sum(axis=0) > 1.5]
df.iloc[0,0] = 'string'
print df
print df.sum(axis=0) > 1.5
print df.loc[:, df.sum(axis=0) > 1.5]
0 1 2
0 NaN 0.336250 0.801349
1 0.930947 0.803907 0.139484
2 0.826946 0.229269 0.367627
0 True
1 False
2 False
dtype: bool
0 NaN
1 0.930947
2 0.826946
0 1 2
0 string 0.336250 0.801349
1 0.930947 0.803907 0.139484
2 0.826946 0.229269 0.367627
1 False
2 False
dtype: bool
Traceback (most recent call last):
pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
Shortly, you need additional preprocess on your data.
If it's convertable string numbers, you can convert it by df.astype(), or you should purge them.

Convert 2D numpy.ndarray to pandas.DataFrame

I have a pretty big numpy.ndarray. Its basically an array of arrays. I want to convert it to a pandas.DataFrame. What I want to do is in the code below
from pandas import DataFrame
cache1 = DataFrame([{'id1': 'ABC1234'}, {'id1': 'NCMN7838'}])
cache2 = DataFrame([{'id2': 3276827}, {'id2': 98567498}, {'id2': 38472837}])
ndarr = [[4.3, 5.6, 6.7], [3.2, 4.5, 2.1]]
arr = []
for idx, i in enumerate(ndarr):
id1 = cache1.ix[idx].id1
for idx2, val in enumerate(i):
id2 = cache2.ix[idx2].id2
if val > 0:
arr.append(dict(id1=id1, id2=id2, value=val))
df = DataFrame(arr)
I am mapping the index of the outer array and the inner array to index of two DataFrames to get certain IDs.
cache1 and cache2 are pandas.DataFrame. Each has ~100k rows.
This takes really really long, like a few hours to complete.
Is there some way I can speed it up?
I suspect your ndarr, if expressed as a 2d np.array, always has the shape of n,m, where n is the length of cache1.id1 and m is the length of cache2.id2. And the last entry in cache2, should be {'id2': 38472837} instead of {'id': 38472837}. If so, the following simple solution may be all what is needed:
In [30]:
index=pd.MultiIndex.from_product([cache1.id1.values, cache2.id2.values],names=['idx1', 'idx2']),
In [33]:
print df.reset_index()
idx1 idx2 val
0 ABC1234 3276827 4.3
1 ABC1234 98567498 5.6
2 ABC1234 38472837 6.7
3 NCMN7838 3276827 3.2
4 NCMN7838 98567498 4.5
5 NCMN7838 38472837 2.1
[6 rows x 3 columns]
Actually, I also think, that keep it having the MultiIndex may be a better idea.
Something like this should work:
ndarr = np.asarray(ndarr) # if ndarr is actually an array, skip this
fast_df = pd.DataFrame({"value": ndarr.ravel()})
i1, i2 = [i.ravel() for i in np.indices(ndarr.shape)]
fast_df["id1"] = cache1["id1"].loc[i1].values
fast_df["id2"] = cache2["id2"].loc[i2].values
which gives
>>> fast_df
value id1 id2
0 4.3 ABC1234 3276827
1 5.6 ABC1234 98567498
2 6.7 ABC1234 NaN
3 3.2 NCMN7838 3276827
4 4.5 NCMN7838 98567498
5 2.1 NCMN7838 NaN
And then if you really want to drop the zero values, you can keep only the nonzero ones using fast_df = fast_df[fast_df['value'] != 0].

Fetching top n records in pandas pivot , based on multiple criteria and plotting them with matplotlib

Usecase : Extending the pivot functionality of Pandas. Fetch top n records & plot them against its own "Click %"(s) vs. no of records of that name
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'name':['A', 'A', 'B', 'B','C','A'], 'click':[1,1,0,1,1,0]})
click name
0 1 A
1 1 A
2 0 B
3 1 B
4 1 C
5 0 A
[6 rows x 2 columns]
#fraction of records present & clicks as a fraction of it's OWN records present
f=df1.pivot_table(rows='name', aggfunc=[len, np.sum])
f['len']['click']/sum(f['len']['click']) , f['sum']['click']/sum(f['sum']['click'])
A 0.500000
B 0.333333
C 0.166667
Name: click, dtype: float64, name
A 0.50
B 0.25
C 0.25
Name: click, dtype: float64)
But to be able to plot them need to store the top n records in an object that is supported by matplotlib.
I tried storing the
"top names" A,B, C ..etc by creating dict (output of
) )- and sorted by values - after which I stored the "click %" [A -> 0.50, B -> 0.25 , C-> 0.25] also in the same dictionary.
**Since this is clearly an overkill - wondering if there's a more pythonic way to do this ? **
I also tried head with groupby clause, but it doesn't give me what I am looking for. I am looking for a dataframe as above
A 0.500000
B 0.333333
C 0.166667
Name: click, dtype: float64, name
A 0.50
B 0.25
C 0.25
except that the top n logic should be embedded (head(n) does not work with n depends on my data-set - I guess I need to use "apply" ? - and post this the Object , which is a "" object needs to be identified by matplotlib with its own labels (top n "name" here)
Here's my dict function implementation :- # This is an OVERKILL just to fetch top n by a custom criteria as above
def freq_counts(df_var,n): # df_var is like , just to make the top n logic generic for each column name
for key,value in perct_freq.items():
if value>=n :
return vec
freq_counts(,3) # eg. top 3 freq counts - to get the names, see vec[i][0] which has the corresponding keys
#In this example when I calculate the "perct_freq", which is a Series object, I would ideally want to avoid converting this to a dict - What an overkill !
Store the actual occurances (len of names) , and find the fraction of a "name" in population
Against this, also fins the "sucess outcome" and find it as a fraction of its OWN population
Finally plot top n name(s), output of (1) & (2) in same plot - criteria for top n should be based on (1) as a percentage
Ie. for (1) & (2) use dataframes that support plot with
name as labels in x axis
(1) as y axis (primary)
(2) as y axis (secondary)
PPS: In the code above -
(1) is > f['len']['click']/sum(f['len']['click']) and
(2) is > f['sum']['click']/sum(f['sum']['click'])