dataframe from 3d list - list

this is what i have (a 3D list) :
**
[ STOCK NAME
last_price price2 price3
0 0.00 0.0 0.0
1 870.95 7650.0 2371500.0
2 870.95 7650.0 2371500.0
3 870.95 7650.0 2371500.0
4 877.30 7650.0 2371500.0
5 879.20 6800.0 2381700.0]
**
I want to create a dataframe exactly like the list that I have above. how do I do so? thank you very much.. i tried pd.DataFrame(the_list) but it gave me this error: ValueError: Must pass 2-d input. shape=(190, 6, 3).. thanks

Related

Getting index values from pd mean() and std() functions

I'm trying to get the index values from a pd std().
My final objective is to match the index with another df and insert the corresponding values (standard deviations).
(in): df_std['index'] = df_std.index
(out): Index([u'AAPL US Equity', u'QQQ US Equity', u'BRABCBACNPR4 BZ Equity'...dtype='object')
However, I've been unable to add the indexes to the "right" of df_std because of the types: df_std.index is a series while df_std is a df. When I try to do it, a line is added instead of a column:
(in): df_std['index'] = df_std.index
(out):
BRSTNCLF1R25 Govt 64.0864
BRITUBACNPR1 BZ Equity 2.67762
BRSTNCNTB4O9 Govt 48.2419
BRSTNCLF1R74 Govt 64.901
PBR US Equity 0.770755
BRBBASACNOR3 BZ Equity 2.93335
BRSTNCLF1R82 Govt 65.0979
index Index([u'AAPL US Equity', u'QQQ US Equity', u'...
dtype: object
I've already tried converting it df_std.inde to a tuple and to a dataframe.
Thanks!
Edit:
I'm trying to match df_std['index'] with df_final['bloomberg_ticker'] and bring the std values to df_final['std']:
(in): print df_final
(out):
serie tipo tp_cnpjfundo valor id bloomberg_ticker \
0 NaN caixa NaN NaN 0 NaN
1 NaN titpublicos NaN NaN 1 BRSTNCLF1R17 Govt
2 NaN titpublicos NaN NaN 2 BRSTNCLF1R17 Govt
3 NaN titpublicos NaN NaN 3 BRSTNCLF1R25 Govt
(the column 'id' will be deleted later)
Use .reset_index() than assigning if what you have is a dataframe i.e
df_std = df_std.reset_index()
Example :
df = pd.DataFrame([0,1,2,3], index=['a','b','c','d'])
df = df.reset_index()
Output :
index 0
0 a 0
1 b 1
2 c 2
3 d 3
In case what you have is a series, convert that to dataframe then reset_index i.e if df_std is the series you have then
df_std = df_std.to_frame().reset_index()
I think what are trying to do is map the values of series to a specific column so you can use
df = pd.DataFrame({'col':['a','b','c','d','e'],'vales':[5,1,2,4,5]})
s = pd.Series([1,2,3],index=['a','b','c'])
df['new'] = df['col'].map(s)
Output :
col vales new
0 a 5 1.0
1 b 1 2.0
2 c 2 3.0
3 d 4 NaN
4 e 5 NaN
In your case you can use df_final['index'].map(df_std)
For conditional check if the index of series is present int he index column of dataframe then you can use .isin i.e
df['col'].isin(s.index) # Returns the boolen mask
df[df['col'].isin(s.index)] #Returns the dataframe based matched index

Slicing a pandas column based on the position of a matching substring

I am trying to slice a pandas column called PATH from a DataFrame called dframe such that I would get the ad1 container's filename with the extension in a new column called AD1position.
PATH
0 \
1 \abc.ad1\xaxaxa
2 \defghij.ad1\wbcbcb
3 \tuvwxyz.ad1\ydeded
In other words, here's what I want to see:
PATH AD1position
0 \
1 \abc.ad1\xaxaxa abc.ad1
2 \defghij.ad1\wbcbcb defghij.ad1
3 \tuvwxyz.ad1\ydeded tuvwxyz.ad1
If I was to do this in Excel, I would write:
=if(iserror(search(".ad1",[PATH])),"",mid([PATH],2,search(".ad1",[PATH]) + 3))
In Python, I seem to be stuck. Here's what I wrote thus far:
dframe['AD1position'] = dframe['PATH'].apply(lambda x: x['PATH'].str[1:(x['PATH'].str.find('.ad1')) \
+ 3] if x['PATH'].str.find('.ad1') != -1 else "")
Doing this returns the following error:
TypeError: string indices must be integers
I suspect that the problem is caused by the function in the slicer, but I'd appreciate any help with figuring out how to resolve this.
use .str.extract() function:
In [17]: df['AD1position'] = df.PATH.str.extract(r'.*?([^\\]*\.ad1)', expand=True)
In [18]: df
Out[18]:
PATH AD1position
0 \ NaN
1 \aaa\bbb NaN
2 \byz.ad1 byz.ad1
3 \abc.ad1\xaxaxa abc.ad1
4 \defghij.ad1\wbcbcb defghij.ad1
5 \tuvwxyz.ad1\ydeded tuvwxyz.ad1
This will get you the first element of the split.
df['AD1position'] = df.PATH.str.split('\\').str.get(1)
Thank you Root.

Not calculating sum for all columns in pandas dataframe

I'm pulling data from Impala using impyla, and converting them to dataframe using as_pandas. And I'm using Pandas 0.18.0, Python 2.7.9
I'm trying to calculate the sum of all columns in a dataframe and trying to select the columns which are greater than the threshold.
self.data = self.data.loc[:,self.data.sum(axis=0) > 15]
But when I run this I'm getting error like below:
pandas.core.indexing.IndexingError: Unalignable boolean Series key
provided
Then I tried like below.
print 'length : ',len(self.data.sum(axis = 0)),' all columns : ',len(self.data.columns)
Then i'm getting different length i.e
length : 78 all columns : 83
And I'm getting below warning
C:\Python27\lib\decimal.py:1150: RuntimeWarning: tp_compare didn't
return -1 or -2 for exception
And To achieve my goal i tried the other way
for column in self.data.columns:
sum = self.data[column].sum()
if( sum < 15 ):
self.data = self.data.drop(column,1)
Now i have got the other errors like below:
TypeError: unsupported operand type(s) for +: 'Decimal' and 'float'
C:\Python27\lib\decimal.py:1150: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
Then i tried to get the data types of each column like below.
print 'dtypes : ', self.data.dtypes
The result has all the columns are one of these int64 , object and float 64
Then i thought of changing the data type of columns which are in object like below
self.data.convert_objects(convert_numeric=True)
Still i'm getting the same errors, Please help me in solving this.
Note : In all the columns I do not have strings i.e characters and missing values or empty.I have checked this using self.data.to_csv
As i'm new to pandas and python Please don't mind if it is a silly question. I just want to learn
Please review the simple code below and you may understand the reason of the error.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random([3,3]))
df.iloc[0,0] = np.nan
print df
print df.sum(axis=0) > 1.5
print df.loc[:, df.sum(axis=0) > 1.5]
df.iloc[0,0] = 'string'
print df
print df.sum(axis=0) > 1.5
print df.loc[:, df.sum(axis=0) > 1.5]
0 1 2
0 NaN 0.336250 0.801349
1 0.930947 0.803907 0.139484
2 0.826946 0.229269 0.367627
0 True
1 False
2 False
dtype: bool
0
0 NaN
1 0.930947
2 0.826946
0 1 2
0 string 0.336250 0.801349
1 0.930947 0.803907 0.139484
2 0.826946 0.229269 0.367627
1 False
2 False
dtype: bool
Traceback (most recent call last):
...
pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
Shortly, you need additional preprocess on your data.
df.select_dtypes(include=['object'])
If it's convertable string numbers, you can convert it by df.astype(), or you should purge them.

Convert 2D numpy.ndarray to pandas.DataFrame

I have a pretty big numpy.ndarray. Its basically an array of arrays. I want to convert it to a pandas.DataFrame. What I want to do is in the code below
from pandas import DataFrame
cache1 = DataFrame([{'id1': 'ABC1234'}, {'id1': 'NCMN7838'}])
cache2 = DataFrame([{'id2': 3276827}, {'id2': 98567498}, {'id2': 38472837}])
ndarr = [[4.3, 5.6, 6.7], [3.2, 4.5, 2.1]]
arr = []
for idx, i in enumerate(ndarr):
id1 = cache1.ix[idx].id1
for idx2, val in enumerate(i):
id2 = cache2.ix[idx2].id2
if val > 0:
arr.append(dict(id1=id1, id2=id2, value=val))
df = DataFrame(arr)
print(df.head())
I am mapping the index of the outer array and the inner array to index of two DataFrames to get certain IDs.
cache1 and cache2 are pandas.DataFrame. Each has ~100k rows.
This takes really really long, like a few hours to complete.
Is there some way I can speed it up?
I suspect your ndarr, if expressed as a 2d np.array, always has the shape of n,m, where n is the length of cache1.id1 and m is the length of cache2.id2. And the last entry in cache2, should be {'id2': 38472837} instead of {'id': 38472837}. If so, the following simple solution may be all what is needed:
In [30]:
df=pd.DataFrame(np.array(ndarr).ravel(),
index=pd.MultiIndex.from_product([cache1.id1.values, cache2.id2.values],names=['idx1', 'idx2']),
columns=['val'])
In [33]:
print df.reset_index()
idx1 idx2 val
0 ABC1234 3276827 4.3
1 ABC1234 98567498 5.6
2 ABC1234 38472837 6.7
3 NCMN7838 3276827 3.2
4 NCMN7838 98567498 4.5
5 NCMN7838 38472837 2.1
[6 rows x 3 columns]
Actually, I also think, that keep it having the MultiIndex may be a better idea.
Something like this should work:
ndarr = np.asarray(ndarr) # if ndarr is actually an array, skip this
fast_df = pd.DataFrame({"value": ndarr.ravel()})
i1, i2 = [i.ravel() for i in np.indices(ndarr.shape)]
fast_df["id1"] = cache1["id1"].loc[i1].values
fast_df["id2"] = cache2["id2"].loc[i2].values
which gives
>>> fast_df
value id1 id2
0 4.3 ABC1234 3276827
1 5.6 ABC1234 98567498
2 6.7 ABC1234 NaN
3 3.2 NCMN7838 3276827
4 4.5 NCMN7838 98567498
5 2.1 NCMN7838 NaN
And then if you really want to drop the zero values, you can keep only the nonzero ones using fast_df = fast_df[fast_df['value'] != 0].

Fetching top n records in pandas pivot , based on multiple criteria and plotting them with matplotlib

Usecase : Extending the pivot functionality of Pandas. Fetch top n records & plot them against its own "Click %"(s) vs. no of records of that name
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'name':['A', 'A', 'B', 'B','C','A'], 'click':[1,1,0,1,1,0]})
click name
0 1 A
1 1 A
2 0 B
3 1 B
4 1 C
5 0 A
[6 rows x 2 columns]
#fraction of records present & clicks as a fraction of it's OWN records present
f=df1.pivot_table(rows='name', aggfunc=[len, np.sum])
f['len']['click']/sum(f['len']['click']) , f['sum']['click']/sum(f['sum']['click'])
(name
A 0.500000
B 0.333333
C 0.166667
Name: click, dtype: float64, name
A 0.50
B 0.25
C 0.25
Name: click, dtype: float64)
But to be able to plot them need to store the top n records in an object that is supported by matplotlib.
I tried storing the
"top names" A,B, C ..etc by creating dict (output of
f['len']['click']/sum(f['len']['click']
) )- and sorted by values - after which I stored the "click %" [A -> 0.50, B -> 0.25 , C-> 0.25] also in the same dictionary.
**Since this is clearly an overkill - wondering if there's a more pythonic way to do this ? **
I also tried head with groupby clause, but it doesn't give me what I am looking for. I am looking for a dataframe as above
A 0.500000
B 0.333333
C 0.166667
Name: click, dtype: float64, name
A 0.50
B 0.25
C 0.25
except that the top n logic should be embedded (head(n) does not work with n depends on my data-set - I guess I need to use "apply" ? - and post this the Object , which is a "" object needs to be identified by matplotlib with its own labels (top n "name" here)
Here's my dict function implementation :- # This is an OVERKILL just to fetch top n by a custom criteria as above
def freq_counts(df_var,n): # df_var is like df1.name , just to make the top n logic generic for each column name
perct_freq=dict((df_var.value_counts()*100)/len(df_var))
vec=[]
for key,value in perct_freq.items():
if value>=n :
vec.append([key,value])
return vec
freq_counts(df1.name,3) # eg. top 3 freq counts - to get the names, see vec[i][0] which has the corresponding keys
#In this example when I calculate the "perct_freq", which is a Series object, I would ideally want to avoid converting this to a dict - What an overkill !
Store the actual occurances (len of names) , and find the fraction of a "name" in population
Against this, also fins the "sucess outcome" and find it as a fraction of its OWN population
Finally plot top n name(s), output of (1) & (2) in same plot - criteria for top n should be based on (1) as a percentage
Ie. for (1) & (2) use dataframes that support plot with
name as labels in x axis
(1) as y axis (primary)
(2) as y axis (secondary)
PPS: In the code above -
(1) is > f['len']['click']/sum(f['len']['click']) and
(2) is > f['sum']['click']/sum(f['sum']['click'])