PySpark Dynamic When Statement - python-2.7

I have a list of strings I am using to create column names. This list is dynamic and may change over time. Depending on the value of the string the column name changes. An example of the code I currently have is below:
df = df.withColumn("newCol", \
F.when(df.pet == "dog", df.dog_Column) \
.otherwise(F.when(df.pet == "cat", df.cat_Column) \
.otherwise(None))))
I want to return the column that is a derivation of the name in the list. I would like to do something like this instead:
dfvalues = ["dog", "cat", "parrot", "goldfish"]
df = df.withColumn("newCol", F.when(df.pet == dfvalues[0], \
F.col(dfvalues[0] + "_Column"))
The issue is that I cannot figure out how to make a looping condition in Pyspark.

One way could be to use a list comprehension in conjuction with a coalesce, very similiar to the answer here.
mycols = [F.when(F.col("pet") == p, F.col(p + "_Column")) for p in dfvalues]
df = df.select("*", F.coalesce(*mycols).alias("newCol"))
This works because when() will return None if the is no otherwise(), and coalesce() will pick the first non-null column.

I faced same problem and found this site link.You can use python reduce to looping for clean solution.
from functools import reduce
def update_col(df1, val):
return df.withColumn('newCol',
F.when(F.col('pet') == val, F.col(val+'_column')) \
.otherwise(F.col('newCol')))
# add empty column
df1 = df.withColumn('newCol', F.lit(0))
reduce(update_col, dfvalues, df1).show()
that yields:
from pyspark.sql import functions as F
dfvalues = ["dog", "cat"]
df = df.withColumn("newCol", F.when(df.pet == dfvalues[0], F.col(dfvalues[0] + "_Column")))
df.show()
+----------+----------+---+------+
|cat_column|dog_column|pet|newCol|
+----------+----------+---+------+
| cat1| dog1|dog| dog1|
| cat2| dog2|cat| cat2|
+----------+----------+---+------+

Related

Create multiple lists and store them into a dictionary Python

Here's my situation. I got two lists:
A list which comes from DF column (OS_names)
A list with the unique values of that column (OS_values)
OS_name = df['OS_name'].tolist()
OS_values = df.OS_name.unique().tolist()
I want to create several lists (one per value in OS_values) like this :
t = []
for i in range(0,len(OS_name)-1):
if OS_values[0] == OS_name[i]:
t.append(1)
else:
t.append(0)
I want to create a list per each value in OS_value, then store them into a dictionary, and at the end creating a df from my dictionary.
If you're able to insert the value as the key it would be great, but not necessary.
I read the defaultdict may be helpful but I cannot find a way to use it.
thanks for the help and have a great day!
I did it at the very end.
dict_stable_feature = dict()
for col in df_stable:
t1 = df[col].tolist()
t2 = df[col].unique().tolist()
for value in t2:
t = []
for i in range (0, len(t1)-1):
if value == t1[i]:
t.append(1)
else:
t.append(0)
cc = str(col)
vv = "_" + str(value)
cv = cc + vv
dict_stable_feature[cv] = t

Counting matrix pairs using a threshold

I have a folder with hundreds of txt files I need to analyse for similarity. Below is an example of a script I use to run similarity analysis. In the end I get an array or a matrix I can plot etc.
I would like to see how many pairs there are with cos_similarity > 0.5 (or any other threshold I decide to use), removing cos_similarity == 1 when I compare the same files, of course.
Secondly, I need a list of these pairs based on file names.
So the output for the example below would look like:
1
and
["doc1", "doc4"]
Will really appreciate your help as I feel a bit lost not knowing which direction to go.
This is an example of my script to get the matrix:
doc1 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints that it is failing to meet that pledge."
doc2 = "The BBC has been inundated with comments from Amazon Prime customers. Most reported problems with deliveries."
doc3 = "An Amazon spokesman told the BBC the ASA had confirmed to it there was no investigation at this time."
doc4 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints..."
documents = [doc1, doc2, doc3, doc4]
# In my real script I iterate through a folder (path) with txt files like this:
#def read_text(path):
# documents = []
# for filename in glob.iglob(path+'*.txt'):
# _file = open(filename, 'r')
# text = _file.read()
# documents.append(text)
# return documents
import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
tf_matrix = LemVectorizer.transform(documents).toarray()
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
tfidf_matrix = tfidfTran.transform(tf_matrix)
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
Out:
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
As I understood your question, you want to create a function that reads the output numpy array and a certain value (threshold) in order to return two things:
how many docs are bigger than or equal the given threshold
the names of these docs.
So, here I've made the following function which takes three arguments:
the output numpy array from cos_similarity() function.
list of document names.
a certain number (threshold).
And here it's:
def get_docs(arr, docs_names, threshold):
output_tuples = []
for row in range(len(arr)):
lst = [row+1+idx for idx, num in \
enumerate(arr[row, row+1:]) if num >= threshold]
for item in lst:
output_tuples.append( (docs_names[row], docs_names[item]) )
return len(output_tuples), output_tuples
Let's see it in action:
>>> docs_names = ["doc1", "doc2", "doc3", "doc4"]
>>> arr = cos_similarity(documents)
>>> arr
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
>>> threshold = 0.5
>>> get_docs(arr, docs_names, threshold)
(1, [('doc1', 'doc4')])
>>> get_docs(arr, docs_names, 1)
(0, [])
>>> get_docs(lst, docs_names, 0.13)
(3, [('doc1', 'doc2'), ('doc1', 'doc4'), ('doc2', 'doc4')])
Let's see how this function works:
first, I iterate over every row of the numpy array.
Second, I iterate over every item in the row whose index is bigger than the row's index. So, we are iterating in a traingular shape like so:
and that's because each pair of documents is mentioned twice in the whole array. We can see that the two values arr[0][1] and arr[1][0] are the same. You also should notice that the diagonal items arn't included because we knew for sure that they are 1 as evey document is very similar to itself :).
Finally, we get the items whose values are bigger than or equal the given threshold, and return their indices. These indices are used later to get the documents names.

How to pass array(multiple column) in below code using pyspark

How to pass array list(multiple column) instead of single column in pyspark using this command:
new_df = new_df.filter(new_df.color.isin(*filter_list) == False)
eg:-
I used this code for removing garbage value(#,$) into single column
filter_list = ['##', '$']
new_df = new_df.filter(new_df.color.isin(*filter_list) == False)
In this example 'color' is column.
But I want to remove garbage(#,##,$,$$$) value with multiple occurrances into multiple column.
Sample Input:-
id name Salary
# Yogita 3000
2 Bhavana 5000
$$ ### 7000
%$4# Neha $$$$
Sample Output:-
id name salary
2 Bhavana 5000
Anybody help me,
Thanks in advance,
Yogita
Here is an answer using a user-defined function:
from pyspark.sql.types import *
from itertools import chain
filter_list = ['#','##', '$', '$$$']
def filterfn(*x):
booleans=list(chain(*[[filter not in elt for filter in filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, booleans, True))
filter_udf=f.udf(filterfn, BooleanType())
new_df.filter(filter_udf(*[col for col in new_df.columns])).show(10)

How to sort python lists due to certain criteria

I would like to sort a list or an array using python to achive the following:
Say my initial list is:
example_list = ["retg_1_gertg","fsvs_1_vs","vrtv_2_srtv","srtv_2_bzt","wft_3_btb","tvsrt_3_rtbbrz"]
I would like to get all the elements that have 1 behind the first underscore together in one list and the ones that have 2 together in one list and so on. So the result should be:
sorted_list = [["retg_1_gertg","fsvs_1_vs"],["vrtv_2_srtv","srtv_2_bzt"],["wft_3_btb","tvsrt_3_rtbbrz"]]
My code:
import numpy as np
import string
example_list = ["retg_1_gertg","fsvs_1_vs","vrtv_2_srtv","srtv_2_bzt","wft_3_btb","tvsrt_3_rtbbrz"]
def sort_list(imagelist):
# get number of wafers
waferlist = []
for image in imagelist:
wafer_id = string.split(image,"_")[1]
waferlist.append(wafer_id)
waferlist = set(waferlist)
waferlist = list(waferlist)
number_of_wafers = len(waferlist)
# create list
sorted_list = []
for i in range(number_of_wafers):
sorted_list.append([])
for i in range(number_of_wafers):
wafer_id = waferlist[i]
for image in imagelist:
if string.split(image,"_")[1] == wafer_id:
sorted_list[i].append(image)
return sorted_list
sorted_list = sort_list(example_list)
works but it is really awkward and it involves many for loops that slow down everything if the lists are large.
Is there any more elegant way using numpy or anything?
Help is appreciated. Thanks.
I'm not sure how much more elegant this solution is; it is a bit more efficient. You could first sort the list and then go through and filter into final set of sorted lists:
example_list = ["retg_1_gertg","fsvs_1_vs","vrtv_2_srtv","srtv_2_bzt","wft_3_btb","tvsrt_3_rtbbrz"]
sorted_list = sorted(example_list, key=lambda x: x[x.index('_')+1])
result = [[]]
current_num = sorted_list[0][sorted_list[0].index('_')+1]
index = 0
for i in example_list:
if current_num != i[i.index('_')+1]:
current_num = i[i.index('_')+1]
index += 1
result.append([])
result[index].append(i)
print result
If you can make assumptions about the values after the first underscore character, you could clean it up a bit (for example, if you knew that they would always be sequential numbers starting at 1).

Pandas append list to list of column names

I'm looking for a way to append a list of column names to existing column names in a DataFrame in pandas and then reorder them by col_start + col_add.
The DataFrame already contains the columns from col_start.
Something like:
import pandas as pd
df = pd.read_csv(file.csv)
col_start = ["col_a", "col_b", "col_c"]
col_add = ["Col_d", "Col_e", "Col_f"]
df = pd.concat([df,pd.DataFrame(columns = list(col_add))]) #Add columns
df = df[[col_start.extend(col_add)]] #Rearrange columns
Also, is there a way to capitalize the first letter for each item in col_start, analogous to title() or capitalize()?
Your code is nearly there, a couple things:
df = pd.concat([df,pd.DataFrame(columns = list(col_add))])
can be simplified to just this as col_add is already a list:
df = pd.concat([df,pd.DataFrame(columns = col_add)])
Also you can also just add 2 lists together so:
df = df[[col_start.extend(col_add)]]
becomes
df = df[col_start+col_add]
And to capitalise the first letter in your list just do:
In [184]:
col_start = ["col_a", "col_b", "col_c"]
col_start = [x.title() for x in col_start]
col_start
Out[184]:
['Col_A', 'Col_B', 'Col_C']
EDIT
To avoid the KeyError on the capitalised column names, you need to capitalise after calling concat, the columns have a vectorised str title method:
In [187]:
df = pd.DataFrame(columns = col_start + col_add)
df
Out[187]:
Empty DataFrame
Columns: [col_a, col_b, col_c, Col_d, Col_e, Col_f]
Index: []
In [188]:
df.columns = df.columns.str.title()
df.columns
Out[188]:
Index(['Col_A', 'Col_B', 'Col_C', 'Col_D', 'Col_E', 'Col_F'], dtype='object')
Here what you want to do :
import pandas as pd
#Here you have a first dataframe
d1 = pd.DataFrame([[1,2,3],[4,5,6]], columns=['col1','col2','col3'])
#a second one
d2 = pd.DataFrame([[8,7,3,8],[4,8,6,8]], columns=['col4','col5','col6', 'col7'])
#Here we can make a dataframe with d1 and d2
d = pd.concat((d1,d2), axis=1)
#We want a different order from the columns ?
d = d[col_start + col_add]
If you want to capitalize values from a column 'col', you can do
d['col'] = d['col'].str.capitalize()
PS: Update Pandas if ".str.capitalize()" doesn't work.
Or, what you can do :
df['col'] = df['col'].map(lambda x:x.capitalize())