How to pass array(multiple column) in below code using pyspark - replace

How to pass array list(multiple column) instead of single column in pyspark using this command:
new_df = new_df.filter(new_df.color.isin(*filter_list) == False)
eg:-
I used this code for removing garbage value(#,$) into single column
filter_list = ['##', '$']
new_df = new_df.filter(new_df.color.isin(*filter_list) == False)
In this example 'color' is column.
But I want to remove garbage(#,##,$,$$$) value with multiple occurrances into multiple column.
Sample Input:-
id name Salary
# Yogita 3000
2 Bhavana 5000
$$ ### 7000
%$4# Neha $$$$
Sample Output:-
id name salary
2 Bhavana 5000
Anybody help me,
Thanks in advance,
Yogita

Here is an answer using a user-defined function:
from pyspark.sql.types import *
from itertools import chain
filter_list = ['#','##', '$', '$$$']
def filterfn(*x):
booleans=list(chain(*[[filter not in elt for filter in filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, booleans, True))
filter_udf=f.udf(filterfn, BooleanType())
new_df.filter(filter_udf(*[col for col in new_df.columns])).show(10)

Related

Pyspark: how to loop through a dataframe column which contains list elements?

I have 2 dataframes (all_posts and headliners). How do I loop through the all_posts['tagged_persons'] column to see if an element of the list AND the corresponding year equal a row of the headliners dataframe?
You can explode the tagged_person column first refer. After that join it with headliners based on tagged_person, Artist and year column and then filter the rows where Artist is null will give you the resultant data.
from pyspark.sql.functions import explode
all_posts = all_posts.select(all_posts.Artist, explode(all_posts.tagged_persons))
cond = [all_posts.tagged_persons == headliners.Artist, all_posts.year == headliners.Year]
join_df = all_posts.join(headliners, cond, 'left')
filter_df = join_df.filter(col("Artist").isNotNull())

PySpark Dynamic When Statement

I have a list of strings I am using to create column names. This list is dynamic and may change over time. Depending on the value of the string the column name changes. An example of the code I currently have is below:
df = df.withColumn("newCol", \
F.when(df.pet == "dog", df.dog_Column) \
.otherwise(F.when(df.pet == "cat", df.cat_Column) \
.otherwise(None))))
I want to return the column that is a derivation of the name in the list. I would like to do something like this instead:
dfvalues = ["dog", "cat", "parrot", "goldfish"]
df = df.withColumn("newCol", F.when(df.pet == dfvalues[0], \
F.col(dfvalues[0] + "_Column"))
The issue is that I cannot figure out how to make a looping condition in Pyspark.
One way could be to use a list comprehension in conjuction with a coalesce, very similiar to the answer here.
mycols = [F.when(F.col("pet") == p, F.col(p + "_Column")) for p in dfvalues]
df = df.select("*", F.coalesce(*mycols).alias("newCol"))
This works because when() will return None if the is no otherwise(), and coalesce() will pick the first non-null column.
I faced same problem and found this site link.You can use python reduce to looping for clean solution.
from functools import reduce
def update_col(df1, val):
return df.withColumn('newCol',
F.when(F.col('pet') == val, F.col(val+'_column')) \
.otherwise(F.col('newCol')))
# add empty column
df1 = df.withColumn('newCol', F.lit(0))
reduce(update_col, dfvalues, df1).show()
that yields:
from pyspark.sql import functions as F
dfvalues = ["dog", "cat"]
df = df.withColumn("newCol", F.when(df.pet == dfvalues[0], F.col(dfvalues[0] + "_Column")))
df.show()
+----------+----------+---+------+
|cat_column|dog_column|pet|newCol|
+----------+----------+---+------+
| cat1| dog1|dog| dog1|
| cat2| dog2|cat| cat2|
+----------+----------+---+------+

Counting matrix pairs using a threshold

I have a folder with hundreds of txt files I need to analyse for similarity. Below is an example of a script I use to run similarity analysis. In the end I get an array or a matrix I can plot etc.
I would like to see how many pairs there are with cos_similarity > 0.5 (or any other threshold I decide to use), removing cos_similarity == 1 when I compare the same files, of course.
Secondly, I need a list of these pairs based on file names.
So the output for the example below would look like:
1
and
["doc1", "doc4"]
Will really appreciate your help as I feel a bit lost not knowing which direction to go.
This is an example of my script to get the matrix:
doc1 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints that it is failing to meet that pledge."
doc2 = "The BBC has been inundated with comments from Amazon Prime customers. Most reported problems with deliveries."
doc3 = "An Amazon spokesman told the BBC the ASA had confirmed to it there was no investigation at this time."
doc4 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints..."
documents = [doc1, doc2, doc3, doc4]
# In my real script I iterate through a folder (path) with txt files like this:
#def read_text(path):
# documents = []
# for filename in glob.iglob(path+'*.txt'):
# _file = open(filename, 'r')
# text = _file.read()
# documents.append(text)
# return documents
import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
tf_matrix = LemVectorizer.transform(documents).toarray()
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
tfidf_matrix = tfidfTran.transform(tf_matrix)
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
Out:
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
As I understood your question, you want to create a function that reads the output numpy array and a certain value (threshold) in order to return two things:
how many docs are bigger than or equal the given threshold
the names of these docs.
So, here I've made the following function which takes three arguments:
the output numpy array from cos_similarity() function.
list of document names.
a certain number (threshold).
And here it's:
def get_docs(arr, docs_names, threshold):
output_tuples = []
for row in range(len(arr)):
lst = [row+1+idx for idx, num in \
enumerate(arr[row, row+1:]) if num >= threshold]
for item in lst:
output_tuples.append( (docs_names[row], docs_names[item]) )
return len(output_tuples), output_tuples
Let's see it in action:
>>> docs_names = ["doc1", "doc2", "doc3", "doc4"]
>>> arr = cos_similarity(documents)
>>> arr
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
>>> threshold = 0.5
>>> get_docs(arr, docs_names, threshold)
(1, [('doc1', 'doc4')])
>>> get_docs(arr, docs_names, 1)
(0, [])
>>> get_docs(lst, docs_names, 0.13)
(3, [('doc1', 'doc2'), ('doc1', 'doc4'), ('doc2', 'doc4')])
Let's see how this function works:
first, I iterate over every row of the numpy array.
Second, I iterate over every item in the row whose index is bigger than the row's index. So, we are iterating in a traingular shape like so:
and that's because each pair of documents is mentioned twice in the whole array. We can see that the two values arr[0][1] and arr[1][0] are the same. You also should notice that the diagonal items arn't included because we knew for sure that they are 1 as evey document is very similar to itself :).
Finally, we get the items whose values are bigger than or equal the given threshold, and return their indices. These indices are used later to get the documents names.

Python: create a pandas data frame from a list

I am using the following code to create a data frame from a list:
test_list = ['a','b','c','d']
df_test = pd.DataFrame.from_records(test_list, columns=['my_letters'])
df_test
The above code works fine. Then I tried the same approach for another list:
import pandas as pd
q_list = ['112354401', '116115526', '114909312', '122425491', '131957025', '111373473']
df1 = pd.DataFrame.from_records(q_list, columns=['q_data'])
df1
But it gave me the following errors this time:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-24-99e7b8e32a52> in <module>()
1 import pandas as pd
2 q_list = ['112354401', '116115526', '114909312', '122425491', '131957025', '111373473']
----> 3 df1 = pd.DataFrame.from_records(q_list, columns=['q_data'])
4 df1
/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
1021 else:
1022 arrays, arr_columns = _to_arrays(data, columns,
-> 1023 coerce_float=coerce_float)
1024
1025 arr_columns = _ensure_index(arr_columns)
/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py in _to_arrays(data, columns, coerce_float, dtype)
5550 data = lmap(tuple, data)
5551 return _list_to_arrays(data, columns, coerce_float=coerce_float,
-> 5552 dtype=dtype)
5553
5554
/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py in _list_to_arrays(data, columns, coerce_float, dtype)
5607 content = list(lib.to_object_array(data).T)
5608 return _convert_object_array(content, columns, dtype=dtype,
-> 5609 coerce_float=coerce_float)
5610
5611
/usr/local/lib/python3.4/dist-packages/pandas/core/frame.py in _convert_object_array(content, columns, coerce_float, dtype)
5666 # caller's responsibility to check for this...
5667 raise AssertionError('%d columns passed, passed data had %s '
-> 5668 'columns' % (len(columns), len(content)))
5669
5670 # provide soft conversion of object dtypes
AssertionError: 1 columns passed, passed data had 9 columns
Why would the same approach work for one list but not another? Any idea what might be wrong here? Thanks a lot!
DataFrame.from_records treats string as a character list. so it needs as many columns as length of string.
You could simply use the DataFrame constructor.
In [3]: pd.DataFrame(q_list, columns=['q_data'])
Out[3]:
q_data
0 112354401
1 116115526
2 114909312
3 122425491
4 131957025
5 111373473
In[20]: test_list = [['a','b','c'], ['AA','BB','CC']]
In[21]: pd.DataFrame(test_list, columns=['col_A', 'col_B', 'col_C'])
Out[21]:
col_A col_B col_C
0 a b c
1 AA BB CC
In[22]: pd.DataFrame(test_list, index=['col_low', 'col_up']).T
Out[22]:
col_low col_up
0 a AA
1 b BB
2 c CC
If you want to create a DataFrame from multiple lists you can simply zip the lists. This returns a 'zip' object. So you convert back to a list.
mydf = pd.DataFrame(list(zip(lstA, lstB)), columns = ['My List A', 'My List B'])
just using concat method
test_list = ['a','b','c','d']
pd.concat(test_list )
You could also take the help of numpy.
import numpy as np
df1 = pd.DataFrame(np.array(q_list),columns=['q_data'])

Pandas Series Resampling: How do I get moves based on certain previous changes?

import pandas as pd
import numpy as np
import datetime as dt
# Create Column names
col_names = ['930', '931', '932', '933', '934', '935']
# Create Index datetimes
idx_names = pd.date_range(start = dt.datetime(2011, 1, 1), periods = 10, freq= 'D')
# Create dataframe with previously created column names and index datetimes
df1 = pd.DataFrame(np.random.randn(10, 6), columns=col_names, index=idx_names)
# Change the column names from strings to datetimes.time() object
df1.columns = [dt.datetime.strptime(x, '%H%M').time() for x in df1.columns]
# This step and the next step changes the dataframe into a chronological timeseries
df2 = df1.T.unstack()
df2.index = [dt.datetime.combine(x[0], x[1]) for x in df2.index.tolist()]
# Show the series
df2
Question: What is the most pythonic/pandas-thonic way to create a specific list? This list would say 'Every time the difference between 9:32 and 9:34 is between 0 and .50, what is the difference between 9:34 and the next day's 9:34.
I was doing this with the numbers in a dataframe format (dates along the x-axis and times along the y-axis) and I would say something like (below is pseudo-code, above is not pseudo-code):
# Create a column with wrong answers and right answers
df['Today 934 minus yesterday 934'] = df[934] - df[934].shift(1)
# Boolean mask were condition 1 (diff > 0) and condition 2 (diff < .5) are true
mask = (df[934].shift(1) - df[932].shift(1) > 0) & (df[934].shift(1) - df[932].shift(1) < .5)
# Apply the boolean mask to the dataframe. This is will remove all the answers
# I dont want from the df['Today 934 minus yesterday 934'] column
df2 = df[mask]
# Only the answers I want:
answers = df['Today 934 minus yesterday 934']
My attempt, basically a filled in version of your pseudo-code. Someone else may have a cleaner approach.
mask1 = (df2.index.hour == 9) & (df2.index.minute == 34)
mask2 = (df2.index.hour == 9) & (df2.index.minute == 32)
diff_934 = df2[mask1] - df2[mask1].shift(-1)
diff_934 = diff_934[diff_934.index.minute == 34]
diff_932 = df2[mask1|mask2] - df2[mask1|mask2].shift(-1)
diff_932 = diff_932[diff_932.index.minute == 34]
diff_932 = diff_932[(diff_932 > 0) & (diff_932 < .5)]
answer = diff_934.reindex(diff_932.index)
In [116]: answer
Out[116]:
2011-01-02 09:34:00 -0.874153
2011-01-08 09:34:00 0.186254
dtype: float64