Pyspark column is not iteratable - list

I have a df similar to this:
old_df = sqlContext.createDataFrame(
[ ('375', 20),
('265', 20),
('052', 20),
('111', None),
],
['old_col', 'example_new_col_val'])
I need to create a new column by checking the values of my old column against a list. I'm new to Pyspark and don't understand my error message. Here's what I've tried:
from pyspark.sql import functions as F
my_list = ['375', '012', '013','014','015','016']
expr = F.when(F.col("old_col").isin(my_list),F.lit(20)).otherwise(None).alias("new_col")
new_df = old_df.select("*",*expr)
My error message: TypeError: Column is not iterable

Get rid of the * in *expr - expr is a column and should not be iterated/unpacked.
new_df = old_df.select("*",expr)

When defining my_list, try using:
my_list = list(['375', '012', '013','014','015','016'].toPandas())
The rest of your code remains the same.

You need to use withColumn() function here in order to create a new column of your existing dataframe
df = df.withColumn("new_col", F.when(F.col("old_col").isin(my_list), F.lit("20")).otherwise(F.lit(None)))

Related

PySpark Dynamic When Statement

I have a list of strings I am using to create column names. This list is dynamic and may change over time. Depending on the value of the string the column name changes. An example of the code I currently have is below:
df = df.withColumn("newCol", \
F.when(df.pet == "dog", df.dog_Column) \
.otherwise(F.when(df.pet == "cat", df.cat_Column) \
.otherwise(None))))
I want to return the column that is a derivation of the name in the list. I would like to do something like this instead:
dfvalues = ["dog", "cat", "parrot", "goldfish"]
df = df.withColumn("newCol", F.when(df.pet == dfvalues[0], \
F.col(dfvalues[0] + "_Column"))
The issue is that I cannot figure out how to make a looping condition in Pyspark.
One way could be to use a list comprehension in conjuction with a coalesce, very similiar to the answer here.
mycols = [F.when(F.col("pet") == p, F.col(p + "_Column")) for p in dfvalues]
df = df.select("*", F.coalesce(*mycols).alias("newCol"))
This works because when() will return None if the is no otherwise(), and coalesce() will pick the first non-null column.
I faced same problem and found this site link.You can use python reduce to looping for clean solution.
from functools import reduce
def update_col(df1, val):
return df.withColumn('newCol',
F.when(F.col('pet') == val, F.col(val+'_column')) \
.otherwise(F.col('newCol')))
# add empty column
df1 = df.withColumn('newCol', F.lit(0))
reduce(update_col, dfvalues, df1).show()
that yields:
from pyspark.sql import functions as F
dfvalues = ["dog", "cat"]
df = df.withColumn("newCol", F.when(df.pet == dfvalues[0], F.col(dfvalues[0] + "_Column")))
df.show()
+----------+----------+---+------+
|cat_column|dog_column|pet|newCol|
+----------+----------+---+------+
| cat1| dog1|dog| dog1|
| cat2| dog2|cat| cat2|
+----------+----------+---+------+

Average of median of a column in a list of dataframes

I am looking for the best way to take the average of median of a column in a list of data frames (same column name).
let's say i have a list of dataframes list_df. I can write the following for loop to get the required output. I am more interested in looking if we can eliminate the for loop
med_arr = []
list_df = [df1, df2, df3]
for df in list_df:
med_arr.append(np.median(df['col_name']))
np.mean(med_arr)
Consider the sample data
np.random.seed([3,1415])
df1 = pd.DataFrame(dict(col_name=np.random.randint(10, size=10)))
df2 = pd.DataFrame(dict(col_name=np.random.randint(10, size=10)))
df3 = pd.DataFrame(dict(col_name=np.random.randint(10, size=10)))
list_df = [df1, df2, df3]
Option 1
pandas
pd.concat([d['col_name'] for d in list_df], axis=1).median().mean()
3.8333333333333335
Option 2
numpy
np.median([d['col_name'].values for d in list_df], 1).mean()
3.8333333333333335
This could be done as a list comprehension:
list_df = [ df1, df2, df3 ]
med_arr = [ np.median( df['col_name'] ) for df in list_df ]
np.mean(med_arr)

python: Finding min values of subsets of a list

I have a list that looks something like this
(The columns would essentially be acct, subacct, value.):
1,1,3
1,2,-4
1,3,1
2,1,1
3,1,2
3,2,4
4,1,1
4,2,-1
I want update the list to look like this:
(The columns are now acct, subacct, value, min of the value for each account)
1,1,3,-4
1,2,-4,-4
1,3,1,-4
2,1,1,1
3,1,2,2
3,2,4,2
4,1,1,-1
4,2,-1,-1
The fourth value is derived by taking the min(value) for each account. So, for account 1, the min is -4, so col4 would be -4 for the three records tied to account 1.
For account 2, there is only one value.
For account 3, the min of 2 and 4 is 2, so the value for col 4 is 2 where account = 3.
I need to preserve col3, as I will need to use the value in column 3 for other calculations later. I also need to create this additional column for output later.
I have tried the following:
with open(file_name, 'rU') as f: #opens PW file
data = zip(*csv.reader(f, delimiter = '\t'))
# data = list(list(rec) for rec in csv.reader(f, delimiter='\t'))
#reads csv into a list of lists
#print the first row
uniqAcct = []
data[0] not in used and (uniqAcct.append(data[0]) or True)
But short of looping through and matching on each unique count and then going back through and adding a new column, I am stuck. I think there must be a pythonic way of doing this, but I cannot figure it out. Any help would be greatly appreciated!
I cannot use numpy, pandas, etc as they cannot be installed on this server yet. I need to use just basic python2
So the problem here is your data structure, it's not trivial to index.
Ideally you'd change it to something readible and keep it in those containers. However if you insist on changing it back into tuples I'd go with this construction
# dummy values
data = [
(1, 1, 3),
(1, 2,-4),
(1, 3, 1),
(2, 1, 1),
(3, 1, 2),
(3, 2, 4),
(4, 1, 1),
(4, 2,-1),
]
class Account:
def __init__(self, acct):
self.acct = acct
self.subaccts = {} # maps sub account id to it's value
def as_tuples(self):
min_value = min(val for val in self.subaccts.values())
for subacct, val in self.subaccts.items():
yield (self.acct, subacct, val, min_value)
def accounts_as_tuples(accounts):
return [ summary for acct_obj in accounts.values() for summary in acct_obj.as_tuples() ]
accounts = {}
for acct, subacct, val in data:
if acct not in accounts:
accounts[acct] = Account(acct)
accounts[acct].subaccts[subacct] = val
print(accounts_as_tuples(accounts))
But ideally, I'd keep it in the Account objects and just add a method that extracts the minimal value of the account when it's needed.
Here is another way using your initial approach.
Modify the way you import your data, so you can easily handle it in python.
import csv
mylist = []
with open(file_name, 'rU') as f: #opens PW file
data = csv.reader(f, delimiter = '\t')
for row in data:
splitted = row[0].split(',')
# this is in case you need integers
splitted = [int(i) for i in splitted]
mylist += [splitted]
Then, add the fourth column
updated = []
for acc in set(zip(*mylist)[0]):
acclist = [x for x in mylist if x[0] == acc]
m = min(i for sublist in acclist for i in sublist)
[l.append(m) for l in acclist]
updated += acclist

Pandas append list to list of column names

I'm looking for a way to append a list of column names to existing column names in a DataFrame in pandas and then reorder them by col_start + col_add.
The DataFrame already contains the columns from col_start.
Something like:
import pandas as pd
df = pd.read_csv(file.csv)
col_start = ["col_a", "col_b", "col_c"]
col_add = ["Col_d", "Col_e", "Col_f"]
df = pd.concat([df,pd.DataFrame(columns = list(col_add))]) #Add columns
df = df[[col_start.extend(col_add)]] #Rearrange columns
Also, is there a way to capitalize the first letter for each item in col_start, analogous to title() or capitalize()?
Your code is nearly there, a couple things:
df = pd.concat([df,pd.DataFrame(columns = list(col_add))])
can be simplified to just this as col_add is already a list:
df = pd.concat([df,pd.DataFrame(columns = col_add)])
Also you can also just add 2 lists together so:
df = df[[col_start.extend(col_add)]]
becomes
df = df[col_start+col_add]
And to capitalise the first letter in your list just do:
In [184]:
col_start = ["col_a", "col_b", "col_c"]
col_start = [x.title() for x in col_start]
col_start
Out[184]:
['Col_A', 'Col_B', 'Col_C']
EDIT
To avoid the KeyError on the capitalised column names, you need to capitalise after calling concat, the columns have a vectorised str title method:
In [187]:
df = pd.DataFrame(columns = col_start + col_add)
df
Out[187]:
Empty DataFrame
Columns: [col_a, col_b, col_c, Col_d, Col_e, Col_f]
Index: []
In [188]:
df.columns = df.columns.str.title()
df.columns
Out[188]:
Index(['Col_A', 'Col_B', 'Col_C', 'Col_D', 'Col_E', 'Col_F'], dtype='object')
Here what you want to do :
import pandas as pd
#Here you have a first dataframe
d1 = pd.DataFrame([[1,2,3],[4,5,6]], columns=['col1','col2','col3'])
#a second one
d2 = pd.DataFrame([[8,7,3,8],[4,8,6,8]], columns=['col4','col5','col6', 'col7'])
#Here we can make a dataframe with d1 and d2
d = pd.concat((d1,d2), axis=1)
#We want a different order from the columns ?
d = d[col_start + col_add]
If you want to capitalize values from a column 'col', you can do
d['col'] = d['col'].str.capitalize()
PS: Update Pandas if ".str.capitalize()" doesn't work.
Or, what you can do :
df['col'] = df['col'].map(lambda x:x.capitalize())

Append new column by subtracting existing column in csv by using python

I tried to append new column to an existing csv file using python. it is not showing any error but the column is not created.
I have a CSV file with 5 columns and I want to add data in the 6th column by subtracting between existing columns.
ID,SURFACES,A1X,A1Y,A1Z,A2X
1,GROUND,800085.3323,961271.977,-3.07E-18,800080.8795
ADD THE COLUMN AX( = A1X - A2X)
CODE:
>>> x = csv.reader(open('E:/solarpotential analysis/iitborientation/trialcsv.csv','rb'))
>>> y = csv.writer(open('E:/solarpotential analysis/iitborientation/trial.csv','wb',buffering=0))
>>> for row in x:
a = float(row[0])
b = str(row[1])
c = float(row[2])
d = float(row[3])
e = float(row[4])
f = float(row[2] - row[5])
y.writerow([a,b,c,d,e,f])
it shows no error but not be updated in output file
You can do this by this way:
inputt=open("input.csv","r")
outputt=open("output.csv","w")
for line in inputt.readlines():
#print line.replace("\n","")
outputt.write(line.replace("\n","") + ";6column\n")
inputt.close()
outputt.close()