Create a dataframe from a list - list

I've got a learner that returns a list of values corresponding to dates.
I need the function to return a dataframe for plotting purposes. I've got the dataframe created, but now I need to populate the dataframe with the values from the list. Here is my code:
learner.addEvidence(x,y_values.values)
y_prediction_list = learner.query(x) # this yields a plain old python list
y_prediction_df = pd.DataFrame(index=dates,columns="Y-Prediction")
y_prediction_df = ??
return y_prediction_df

you can simply create the dataframe with:
y_prediction_df=pd.DataFrame({"Y-Prediction":y_prediction_list},index=dates)

I think you can use parameter data of DataFrame.
import pandas as pd
#test data
dates = ["2014-05-22 05:37:59", "2015-05-22 05:37:59","2016-05-22 05:37:59"]
y_prediction_list = [1,2,3]
y_prediction_df = pd.DataFrame(data=y_prediction_list, index=dates,columns=["Y-Prediction"])
print y_prediction_df
# Y-Prediction
#2014-05-22 05:37:59 1
#2015-05-22 05:37:59 2
#2016-05-22 05:37:59 3
print y_prediction_df.info()
#<class 'pandas.core.frame.DataFrame'>
#Index: 3 entries, 2014-05-22 05:37:59 to 2016-05-22 05:37:59
#Data columns (total 1 columns):
#Y-Prediction 3 non-null int64
#dtypes: int64(1)
#memory usage: 48.0+ bytes
#None

Related

find unique values row wise on comma separated values

For a dataframe like below:
df = pd.DataFrame({'col':['abc,def,ghi,jkl,abc','abc,def,ghi,def,ghi']})
How to get unique values of the column col row wise in a new column like as follows:
col unique_col
0 abc,def,ghi,jkl,abc abc,def,ghi,jkl
1 abc,def,ghi,def,ghi abc,def,ghi
I tried using iteritems but got Attribute error :
for i, item in df.col.iteritems():
print item.unique()
import pandas as pd
df = pd.DataFrame({'col':['abc,def,ghi,jkl,abc','abc,def,ghi,def,ghi']})
def unique_col(col):
return ','.join(set(col.split(',')))
df['unique_col'] = df.col.apply(unique_col)
result:
col unique_col
0 abc,def,ghi,jkl,abc ghi,jkl,abc,def
1 abc,def,ghi,def,ghi ghi,abc,def

Looping over a list of pandas data frames and create a new data frame

I have a list of data frames:
data_frames = [sort,sort1,sort2]
I'd like to iterate over them and store some stats in a new df. I feel like this is something trivial but the function below returns an empty data frame df_concat = df_stats(data_frames). What am I missing? Will appreciate your help.
Create an example data set:
import pandas as pd
data = {'number': [23,56,89], 'PVs': [23456, 34456, 6789]}
sort = pd.DataFrame.from_dict(data)
data1 = {'number': [28,52,12], 'PVs': [3423456, 2334456, 36789]}
sort1 = pd.DataFrame.from_dict(data1)
data2 = {'number': [123,5,86], 'PVs': [2345655, 934456, 16789]}
sort2 = pd.DataFrame.from_dict(data2)
The function to iterate over data frames:
def df_stats(data_frames):
df = pd.DataFrame()
for data in data_frames:
df['Number'] = data.number.count()
df["Total PVs"] = '{0:,.0f}'.format(data.PVs.sum())
df["Average"] = '{0:,.0f}'.format(data.PVs.mean())
df["Median"] = '{0:,.0f}'.format(data.PVs.median())
return df
We can using pd.concat+groupby rather than for loop
pd.concat(data_frames,keys=[1,2,3]).groupby(level=0).agg({'number':'count','PVs':['sum','mean','median']})
Out[1117]:
number PVs
count sum mean median
1 3 64701 2.156700e+04 23456
2 3 5794701 1.931567e+06 2334456
3 3 3296900 1.098967e+06 934456
Also if you want to using your function you can fix it to
df = pd.DataFrame()
for i,data in enumerate(data_frames):
df.at[i,'Number'] = data.number.count()
df.at[i,"Total PVs"] = '{0:,.0f}'.format(data.PVs.sum())
df.at[i,"Average"] = '{0:,.0f}'.format(data.PVs.mean())
df.at[i,"Median"] = '{0:,.0f}'.format(data.PVs.median())
df
Out[1121]:
Number Total PVs Average Median
0 3.0 64,701 21,567 23,456
1 3.0 5,794,701 1,931,567 2,334,456
2 3.0 3,296,900 1,098,967 934,456
Try this:
''' Example DataFrames '''
data1 = pd.DataFrame({'number': [23,56,89], 'PVs': [23456, 34456, 6789]},
columns=['number', 'PVs'])
data2 = pd.DataFrame({'number': [28,52,12], 'PVs': [3423456, 2334456, 36789]}, columns=['number', 'PVs'])
data3 = pd.DataFrame({'number': [123,5,86], 'PVs': [2345655, 934456, 16789]},
columns=['number', 'PVs'])
''' The function returning the stats '''
def df_stats(dataFrame):
df = pd.DataFrame({}, columns=['Number', 'Total PVs', 'Average', 'Median'])
df.loc['Number'] = dataFrame['number'].count()
df["Total PVs"] = '{0:,.0f}'.format(dataFrame['PVs'].sum())
df["Average"] = '{0:,.0f}'.format(dataFrame['PVs'].mean())
df["Median"] = '{0:,.0f}'.format(dataFrame['PVs'].median())
return df
''' Create a list of DataFrames to iterate over '''
data_frames = [data1, data2, data3]
''' Create an emmpty DataFrame so you can include it in pd.concat() '''
result = pd.DataFrame()
''' Iterate over DataFrame list and concatenate'''
for dataFrame in data_frames:
tempDF = df_stats(dataFrame)
result = pd.concat([result,tempDF], ignore_index=True)
result.head(3)
The output is:
Number Total PVs Average Median
0 3 64,701 21,567 23,456
1 3 5,794,701 1,931,567 2,334,456
2 3 3,296,900 1,098,967 934,456
The below functions works
dict_df ={'df1':sort1,'df':sort,'df2':sort2}
def df_stats(dict_df):
df = pd.DataFrame(columns=['Number','Total PVs','Average','Median'],index=dict_df.keys())
for name,data in dict_df.items():
df.loc[name,"Number"] = data.number.count()
df.loc[name,"Total PVs"] = '{0:,.0f}'.format(data.PVs.sum())
df.loc[name,"Average"] = '{0:,.0f}'.format(data.PVs.mean())
df.loc[name,"Median"] = '{0:,.0f}'.format(data.PVs.median())
return df
Output:
Number Total PVs Average Median
df2 3 3,296,900 1,098,967 934,456
df1 3 5,794,701 1,931,567 2,334,456
df 3 64,701 21,567 23,456

How to pass array(multiple column) in below code using pyspark

How to pass array list(multiple column) instead of single column in pyspark using this command:
new_df = new_df.filter(new_df.color.isin(*filter_list) == False)
eg:-
I used this code for removing garbage value(#,$) into single column
filter_list = ['##', '$']
new_df = new_df.filter(new_df.color.isin(*filter_list) == False)
In this example 'color' is column.
But I want to remove garbage(#,##,$,$$$) value with multiple occurrances into multiple column.
Sample Input:-
id name Salary
# Yogita 3000
2 Bhavana 5000
$$ ### 7000
%$4# Neha $$$$
Sample Output:-
id name salary
2 Bhavana 5000
Anybody help me,
Thanks in advance,
Yogita
Here is an answer using a user-defined function:
from pyspark.sql.types import *
from itertools import chain
filter_list = ['#','##', '$', '$$$']
def filterfn(*x):
booleans=list(chain(*[[filter not in elt for filter in filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, booleans, True))
filter_udf=f.udf(filterfn, BooleanType())
new_df.filter(filter_udf(*[col for col in new_df.columns])).show(10)

Python: create a pandas data frame from a list

I am using the following code to create a data frame from a list:
test_list = ['a','b','c','d']
df_test = pd.DataFrame.from_records(test_list, columns=['my_letters'])
df_test
The above code works fine. Then I tried the same approach for another list:
import pandas as pd
q_list = ['112354401', '116115526', '114909312', '122425491', '131957025', '111373473']
df1 = pd.DataFrame.from_records(q_list, columns=['q_data'])
df1
But it gave me the following errors this time:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-24-99e7b8e32a52> in <module>()
1 import pandas as pd
2 q_list = ['112354401', '116115526', '114909312', '122425491', '131957025', '111373473']
----> 3 df1 = pd.DataFrame.from_records(q_list, columns=['q_data'])
4 df1
/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
1021 else:
1022 arrays, arr_columns = _to_arrays(data, columns,
-> 1023 coerce_float=coerce_float)
1024
1025 arr_columns = _ensure_index(arr_columns)
/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py in _to_arrays(data, columns, coerce_float, dtype)
5550 data = lmap(tuple, data)
5551 return _list_to_arrays(data, columns, coerce_float=coerce_float,
-> 5552 dtype=dtype)
5553
5554
/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py in _list_to_arrays(data, columns, coerce_float, dtype)
5607 content = list(lib.to_object_array(data).T)
5608 return _convert_object_array(content, columns, dtype=dtype,
-> 5609 coerce_float=coerce_float)
5610
5611
/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py in _convert_object_array(content, columns, coerce_float, dtype)
5666 # caller's responsibility to check for this...
5667 raise AssertionError('%d columns passed, passed data had %s '
-> 5668 'columns' % (len(columns), len(content)))
5669
5670 # provide soft conversion of object dtypes
AssertionError: 1 columns passed, passed data had 9 columns
Why would the same approach work for one list but not another? Any idea what might be wrong here? Thanks a lot!
DataFrame.from_records treats string as a character list. so it needs as many columns as length of string.
You could simply use the DataFrame constructor.
In [3]: pd.DataFrame(q_list, columns=['q_data'])
Out[3]:
q_data
0 112354401
1 116115526
2 114909312
3 122425491
4 131957025
5 111373473
In[20]: test_list = [['a','b','c'], ['AA','BB','CC']]
In[21]: pd.DataFrame(test_list, columns=['col_A', 'col_B', 'col_C'])
Out[21]:
col_A col_B col_C
0 a b c
1 AA BB CC
In[22]: pd.DataFrame(test_list, index=['col_low', 'col_up']).T
Out[22]:
col_low col_up
0 a AA
1 b BB
2 c CC
If you want to create a DataFrame from multiple lists you can simply zip the lists. This returns a 'zip' object. So you convert back to a list.
mydf = pd.DataFrame(list(zip(lstA, lstB)), columns = ['My List A', 'My List B'])
just using concat method
test_list = ['a','b','c','d']
pd.concat(test_list )
You could also take the help of numpy.
import numpy as np
df1 = pd.DataFrame(np.array(q_list),columns=['q_data'])

Pandas append list to list of column names

I'm looking for a way to append a list of column names to existing column names in a DataFrame in pandas and then reorder them by col_start + col_add.
The DataFrame already contains the columns from col_start.
Something like:
import pandas as pd
df = pd.read_csv(file.csv)
col_start = ["col_a", "col_b", "col_c"]
col_add = ["Col_d", "Col_e", "Col_f"]
df = pd.concat([df,pd.DataFrame(columns = list(col_add))]) #Add columns
df = df[[col_start.extend(col_add)]] #Rearrange columns
Also, is there a way to capitalize the first letter for each item in col_start, analogous to title() or capitalize()?
Your code is nearly there, a couple things:
df = pd.concat([df,pd.DataFrame(columns = list(col_add))])
can be simplified to just this as col_add is already a list:
df = pd.concat([df,pd.DataFrame(columns = col_add)])
Also you can also just add 2 lists together so:
df = df[[col_start.extend(col_add)]]
becomes
df = df[col_start+col_add]
And to capitalise the first letter in your list just do:
In [184]:
col_start = ["col_a", "col_b", "col_c"]
col_start = [x.title() for x in col_start]
col_start
Out[184]:
['Col_A', 'Col_B', 'Col_C']
EDIT
To avoid the KeyError on the capitalised column names, you need to capitalise after calling concat, the columns have a vectorised str title method:
In [187]:
df = pd.DataFrame(columns = col_start + col_add)
df
Out[187]:
Empty DataFrame
Columns: [col_a, col_b, col_c, Col_d, Col_e, Col_f]
Index: []
In [188]:
df.columns = df.columns.str.title()
df.columns
Out[188]:
Index(['Col_A', 'Col_B', 'Col_C', 'Col_D', 'Col_E', 'Col_F'], dtype='object')
Here what you want to do :
import pandas as pd
#Here you have a first dataframe
d1 = pd.DataFrame([[1,2,3],[4,5,6]], columns=['col1','col2','col3'])
#a second one
d2 = pd.DataFrame([[8,7,3,8],[4,8,6,8]], columns=['col4','col5','col6', 'col7'])
#Here we can make a dataframe with d1 and d2
d = pd.concat((d1,d2), axis=1)
#We want a different order from the columns ?
d = d[col_start + col_add]
If you want to capitalize values from a column 'col', you can do
d['col'] = d['col'].str.capitalize()
PS: Update Pandas if ".str.capitalize()" doesn't work.
Or, what you can do :
df['col'] = df['col'].map(lambda x:x.capitalize())