find unique values row wise on comma separated values - python-2.7

For a dataframe like below:
df = pd.DataFrame({'col':['abc,def,ghi,jkl,abc','abc,def,ghi,def,ghi']})
How to get unique values of the column col row wise in a new column like as follows:
col unique_col
0 abc,def,ghi,jkl,abc abc,def,ghi,jkl
1 abc,def,ghi,def,ghi abc,def,ghi
I tried using iteritems but got Attribute error :
for i, item in df.col.iteritems():
print item.unique()

import pandas as pd
df = pd.DataFrame({'col':['abc,def,ghi,jkl,abc','abc,def,ghi,def,ghi']})
def unique_col(col):
return ','.join(set(col.split(',')))
df['unique_col'] = df.col.apply(unique_col)
result:
col unique_col
0 abc,def,ghi,jkl,abc ghi,jkl,abc,def
1 abc,def,ghi,def,ghi ghi,abc,def

Related

Pyspark: how to loop through a dataframe column which contains list elements?

I have 2 dataframes (all_posts and headliners). How do I loop through the all_posts['tagged_persons'] column to see if an element of the list AND the corresponding year equal a row of the headliners dataframe?
You can explode the tagged_person column first refer. After that join it with headliners based on tagged_person, Artist and year column and then filter the rows where Artist is null will give you the resultant data.
from pyspark.sql.functions import explode
all_posts = all_posts.select(all_posts.Artist, explode(all_posts.tagged_persons))
cond = [all_posts.tagged_persons == headliners.Artist, all_posts.year == headliners.Year]
join_df = all_posts.join(headliners, cond, 'left')
filter_df = join_df.filter(col("Artist").isNotNull())

Reading CSV file and take majority voting of certain column

I need to calculate the majority vote for an TARGET_LABEL Column of my CSV file in Python.
I have a data frame with Row ID and assigned TARGET_LABEL. What I need is the count of TARGET_LABEL(majority). How do I do this?
For Example Data is in this form:
**Row ID TARGET_LABEL**
Row2 0
Row6 0
Row7 0
Row10 0
Row12 0
Row15 1
. .
. .
Row99999 1
I have python script which only reads data from CSV. Here It is
import csv
ifile = open('file1.csv', "rb")
reader = csv.reader(ifile)
rownum = 0
for row in reader:
# Save header row.
if rownum == 0:
header = row
else:
colnum = 0
for col in row:
print '%-8s: %s' % (header[colnum], col)
colnum += 1
rownum += 1
ifile.close()
In case TARGET_LABEL** does not have a NaN values, you could use:
counts = df['TARGET_LABEL'].value_counts()
max_counts = counts.max()
Otherwise if it could contain NaN values, use
df = df.dropna(subset=['TARGET_LABEL'])
removes all the NaN values
df['TARGET_LABEL'].value_counts().max()
should give you the max counts,
df['TARGET_LABEL'].value_counts().idxmax()
should give you the most frequent value.
The package collection contains the class Counter which works similar to a dict (or more precisely a defaultdict(lambda: 0)) and which can be used to find the most frequent item.

Create a dataframe from a list

I've got a learner that returns a list of values corresponding to dates.
I need the function to return a dataframe for plotting purposes. I've got the dataframe created, but now I need to populate the dataframe with the values from the list. Here is my code:
learner.addEvidence(x,y_values.values)
y_prediction_list = learner.query(x) # this yields a plain old python list
y_prediction_df = pd.DataFrame(index=dates,columns="Y-Prediction")
y_prediction_df = ??
return y_prediction_df
you can simply create the dataframe with:
y_prediction_df=pd.DataFrame({"Y-Prediction":y_prediction_list},index=dates)
I think you can use parameter data of DataFrame.
import pandas as pd
#test data
dates = ["2014-05-22 05:37:59", "2015-05-22 05:37:59","2016-05-22 05:37:59"]
y_prediction_list = [1,2,3]
y_prediction_df = pd.DataFrame(data=y_prediction_list, index=dates,columns=["Y-Prediction"])
print y_prediction_df
# Y-Prediction
#2014-05-22 05:37:59 1
#2015-05-22 05:37:59 2
#2016-05-22 05:37:59 3
print y_prediction_df.info()
#<class 'pandas.core.frame.DataFrame'>
#Index: 3 entries, 2014-05-22 05:37:59 to 2016-05-22 05:37:59
#Data columns (total 1 columns):
#Y-Prediction 3 non-null int64
#dtypes: int64(1)
#memory usage: 48.0+ bytes
#None

Pandas append list to list of column names

I'm looking for a way to append a list of column names to existing column names in a DataFrame in pandas and then reorder them by col_start + col_add.
The DataFrame already contains the columns from col_start.
Something like:
import pandas as pd
df = pd.read_csv(file.csv)
col_start = ["col_a", "col_b", "col_c"]
col_add = ["Col_d", "Col_e", "Col_f"]
df = pd.concat([df,pd.DataFrame(columns = list(col_add))]) #Add columns
df = df[[col_start.extend(col_add)]] #Rearrange columns
Also, is there a way to capitalize the first letter for each item in col_start, analogous to title() or capitalize()?
Your code is nearly there, a couple things:
df = pd.concat([df,pd.DataFrame(columns = list(col_add))])
can be simplified to just this as col_add is already a list:
df = pd.concat([df,pd.DataFrame(columns = col_add)])
Also you can also just add 2 lists together so:
df = df[[col_start.extend(col_add)]]
becomes
df = df[col_start+col_add]
And to capitalise the first letter in your list just do:
In [184]:
col_start = ["col_a", "col_b", "col_c"]
col_start = [x.title() for x in col_start]
col_start
Out[184]:
['Col_A', 'Col_B', 'Col_C']
EDIT
To avoid the KeyError on the capitalised column names, you need to capitalise after calling concat, the columns have a vectorised str title method:
In [187]:
df = pd.DataFrame(columns = col_start + col_add)
df
Out[187]:
Empty DataFrame
Columns: [col_a, col_b, col_c, Col_d, Col_e, Col_f]
Index: []
In [188]:
df.columns = df.columns.str.title()
df.columns
Out[188]:
Index(['Col_A', 'Col_B', 'Col_C', 'Col_D', 'Col_E', 'Col_F'], dtype='object')
Here what you want to do :
import pandas as pd
#Here you have a first dataframe
d1 = pd.DataFrame([[1,2,3],[4,5,6]], columns=['col1','col2','col3'])
#a second one
d2 = pd.DataFrame([[8,7,3,8],[4,8,6,8]], columns=['col4','col5','col6', 'col7'])
#Here we can make a dataframe with d1 and d2
d = pd.concat((d1,d2), axis=1)
#We want a different order from the columns ?
d = d[col_start + col_add]
If you want to capitalize values from a column 'col', you can do
d['col'] = d['col'].str.capitalize()
PS: Update Pandas if ".str.capitalize()" doesn't work.
Or, what you can do :
df['col'] = df['col'].map(lambda x:x.capitalize())

Append new column by subtracting existing column in csv by using python

I tried to append new column to an existing csv file using python. it is not showing any error but the column is not created.
I have a CSV file with 5 columns and I want to add data in the 6th column by subtracting between existing columns.
ID,SURFACES,A1X,A1Y,A1Z,A2X
1,GROUND,800085.3323,961271.977,-3.07E-18,800080.8795
ADD THE COLUMN AX( = A1X - A2X)
CODE:
>>> x = csv.reader(open('E:/solarpotential analysis/iitborientation/trialcsv.csv','rb'))
>>> y = csv.writer(open('E:/solarpotential analysis/iitborientation/trial.csv','wb',buffering=0))
>>> for row in x:
a = float(row[0])
b = str(row[1])
c = float(row[2])
d = float(row[3])
e = float(row[4])
f = float(row[2] - row[5])
y.writerow([a,b,c,d,e,f])
it shows no error but not be updated in output file
You can do this by this way:
inputt=open("input.csv","r")
outputt=open("output.csv","w")
for line in inputt.readlines():
#print line.replace("\n","")
outputt.write(line.replace("\n","") + ";6column\n")
inputt.close()
outputt.close()