Pandas append list to list of column names - python-2.7

I'm looking for a way to append a list of column names to existing column names in a DataFrame in pandas and then reorder them by col_start + col_add.
The DataFrame already contains the columns from col_start.
Something like:
import pandas as pd
df = pd.read_csv(file.csv)
col_start = ["col_a", "col_b", "col_c"]
col_add = ["Col_d", "Col_e", "Col_f"]
df = pd.concat([df,pd.DataFrame(columns = list(col_add))]) #Add columns
df = df[[col_start.extend(col_add)]] #Rearrange columns
Also, is there a way to capitalize the first letter for each item in col_start, analogous to title() or capitalize()?

Your code is nearly there, a couple things:
df = pd.concat([df,pd.DataFrame(columns = list(col_add))])
can be simplified to just this as col_add is already a list:
df = pd.concat([df,pd.DataFrame(columns = col_add)])
Also you can also just add 2 lists together so:
df = df[[col_start.extend(col_add)]]
becomes
df = df[col_start+col_add]
And to capitalise the first letter in your list just do:
In [184]:
col_start = ["col_a", "col_b", "col_c"]
col_start = [x.title() for x in col_start]
col_start
Out[184]:
['Col_A', 'Col_B', 'Col_C']
EDIT
To avoid the KeyError on the capitalised column names, you need to capitalise after calling concat, the columns have a vectorised str title method:
In [187]:
df = pd.DataFrame(columns = col_start + col_add)
df
Out[187]:
Empty DataFrame
Columns: [col_a, col_b, col_c, Col_d, Col_e, Col_f]
Index: []
In [188]:
df.columns = df.columns.str.title()
df.columns
Out[188]:
Index(['Col_A', 'Col_B', 'Col_C', 'Col_D', 'Col_E', 'Col_F'], dtype='object')

Here what you want to do :
import pandas as pd
#Here you have a first dataframe
d1 = pd.DataFrame([[1,2,3],[4,5,6]], columns=['col1','col2','col3'])
#a second one
d2 = pd.DataFrame([[8,7,3,8],[4,8,6,8]], columns=['col4','col5','col6', 'col7'])
#Here we can make a dataframe with d1 and d2
d = pd.concat((d1,d2), axis=1)
#We want a different order from the columns ?
d = d[col_start + col_add]
If you want to capitalize values from a column 'col', you can do
d['col'] = d['col'].str.capitalize()
PS: Update Pandas if ".str.capitalize()" doesn't work.
Or, what you can do :
df['col'] = df['col'].map(lambda x:x.capitalize())

Related

find unique values row wise on comma separated values

For a dataframe like below:
df = pd.DataFrame({'col':['abc,def,ghi,jkl,abc','abc,def,ghi,def,ghi']})
How to get unique values of the column col row wise in a new column like as follows:
col unique_col
0 abc,def,ghi,jkl,abc abc,def,ghi,jkl
1 abc,def,ghi,def,ghi abc,def,ghi
I tried using iteritems but got Attribute error :
for i, item in df.col.iteritems():
print item.unique()
import pandas as pd
df = pd.DataFrame({'col':['abc,def,ghi,jkl,abc','abc,def,ghi,def,ghi']})
def unique_col(col):
return ','.join(set(col.split(',')))
df['unique_col'] = df.col.apply(unique_col)
result:
col unique_col
0 abc,def,ghi,jkl,abc ghi,jkl,abc,def
1 abc,def,ghi,def,ghi ghi,abc,def

Concatenate two data frames to a data frame of square matrix

I have two pandas dataframes of which shapes are "n x n" and "m x n" (m < n). For example:
df1 = pd.DataFrame([[0,1,0,1],[1,0,0,1],[0,0,0,1],[1,1,1,0]])
df2 = pd.DataFrame([[1,1,1,0],[1,1,0,1]])
I'd like to get the dataframe of a square matrix by concatenating above dataframes:
df3 = foo(df1, df2)
print df3.values
This should print like the following matrix.
[[0,1,0,1,1,1],
[1,0,0,1,1,1],
[0,0,0,1,1,0],
[1,1,1,0,0,1],
[1,1,1,0,0,0],
[1,1,0,1,0,0]]
The logic of concatination is like this:
the upper-left part of the square matrix comes from df1
the upper-right part of it comes from the transpose of df2
the bottom-left part of it comes from df2
all element of the rest of it (bottom-right part) is zero.
How do I implement the above logic (foo method)?
Here is a sample of foo:
def foo(_df1,_df2):
df1 = _df1.reset_index(drop=True) #to make sure the index is ordered
df2 = _df2.reset_index(drop=True) #to make sure the index is ordered
df2_transpose = df2.transpose().reset_index(drop=True) #reset the index to match the join below
df_upper = df1.join(df2_transpose,rsuffix="_") #add suffix for additional columns
df_upper.columns = [i for i in range(df_upper.shape[1])] #reset column names to int
df = pd.concat([df_upper,df2]) #fill the bottom left
df.fillna(0,inplace=True) #fill with 0 the bottom right
return df
The foo function:
def foo(df1_data,df2_data):
df_test = pd.concat([df1_data,df2_data])
a = np.concatenate((df2_data.values.T,np.zeros(shape = (df_test.values.shape[0] - df_test.values.shape[1],df2_data.values.shape[0]))))
final_array = np.append(df_test.values,a, axis=1).astype(int)
df3_data = pd.DataFrame(final_array)
return df3_data
df3 = foo(df1,df2)

Average of median of a column in a list of dataframes

I am looking for the best way to take the average of median of a column in a list of data frames (same column name).
let's say i have a list of dataframes list_df. I can write the following for loop to get the required output. I am more interested in looking if we can eliminate the for loop
med_arr = []
list_df = [df1, df2, df3]
for df in list_df:
med_arr.append(np.median(df['col_name']))
np.mean(med_arr)
Consider the sample data
np.random.seed([3,1415])
df1 = pd.DataFrame(dict(col_name=np.random.randint(10, size=10)))
df2 = pd.DataFrame(dict(col_name=np.random.randint(10, size=10)))
df3 = pd.DataFrame(dict(col_name=np.random.randint(10, size=10)))
list_df = [df1, df2, df3]
Option 1
pandas
pd.concat([d['col_name'] for d in list_df], axis=1).median().mean()
3.8333333333333335
Option 2
numpy
np.median([d['col_name'].values for d in list_df], 1).mean()
3.8333333333333335
This could be done as a list comprehension:
list_df = [ df1, df2, df3 ]
med_arr = [ np.median( df['col_name'] ) for df in list_df ]
np.mean(med_arr)

ValueError: Shape of passed values is (6, 251), indices imply (6, 1)

I am getting an error and I'm not sure how to fix it.
Here is my code:
from matplotlib.finance import quotes_historical_yahoo_ochl
from datetime import date
from datetime import datetime
import pandas as pd
today = date.today()
start = (today.year-1, today.month, today.day)
quotes = quotes_historical_yahoo_ochl('AXP', start, today)
fields = ['date', 'open', 'close', 'high', 'low', 'volume']
list1 = []
for i in range(len(quotes)):
x = date.fromordinal(int(quotes[i][0]))
y = datetime.strftime(x, '%Y-%m-%d')
list1.append(y)
quotesdf = pd.DataFrame(quotes, index = list1, columns = fields)
quotesdf = quotesdf.drop(['date'], axis = 1)
print quotesdf
How can I change my code to achieve my goal, change the dateform and delete the original one?
In principle your code should work, you just need to indent it correctly, that is, you need to append the value of y to list1 inside the for loop.
for i in range(len(quotes)):
x = date.fromordinal(int(quotes[i][0]))
y = datetime.strftime(x, '%Y-%m-%d')
list1.append(y)
Thereby list1 will have as many entries as quotes instead of only one (the last one). And the final dataframe will not complain about misshaped data.

Printing Results from Loops

I currently have a piece of code that works in two segments. The first segment opens the existing text file from a specific path on my local drive and then arranges, based on certain indices, into a list of sub list. In the second segment I take the sub-lists I have created and group them on a similar index to simplify them (starts at def merge_subs). I am getting no error code but I am not receiving a result when I try to print the variable answer. Am I not correctly looping the original list of sub-lists? Ultimately I would like to have a variable that contains the final product from these loops so that I may write the contents of it to a new text file. Here is the code I am working with:
from itertools import groupby, chain
from operator import itemgetter
with open ("somepathname") as g:
# reads text from lines and turns them into a list sub-lists
lines = g.readlines()
for line in lines:
matrix = line.split()
JD = matrix [2]
minTime= matrix [5]
maxTime= matrix [7]
newLists = [JD,minTime,maxTime]
L = newLists
def merge_subs(L):
dates = {}
for sub in L:
date = sub[0]
if date not in dates:
dates[date] = []
dates[date].extend(sub[1:])
answer = []
for date in sorted(dates):
answer.append([date] + dates[date])
new code
def openfile(self):
filename = askopenfilename(parent=root)
self.lines = open(filename)
def simplify(self):
g = self.lines.readlines()
for line in g:
matrix = line.split()
JD = matrix[2]
minTime = matrix[5]
maxTime = matrix[7]
self.newLists = [JD, minTime, maxTime]
print(self.newLists)
dates = {}
for sub in self.newLists:
date = sub[0]
if date not in dates:
dates[date] = []
dates[date].extend(sub[1:])
answer = []
for date in sorted(dates):
print(answer.append([date] + dates[date]))
enter code here
enter code here