pyspark column transformation - list

I have two predefined lists as below.
East = ["Bengal", "Bihar", "Assam"]
West = ["Bombay", "Gujarat", "Goa"]
I have a pyspark dataframe as below. I need to add a third column (State) in the dataframe depending upon the name in the second column after searching in the lists(City).
df:
Num City
1 Bengal
2 Goa
3 Bombay
4 Bihar
Expected output:
Num City State
1 Bengal East
2 Goa West
3 Bombay West
4 Bihar East
Thanks

You can use the isin function.
East = ["Bengal", "Bihar", "Assam"]
West = ["Bombay", "Gujarat", "Goa"]
from pyspark.sql.functions import when, col
df.withColumn("state", when(col("City").isin(East), "East")\
.when(col("City").isin(West), "West").otherwise(None)).show()
+---+------+-----+
|Num| City|state|
+---+------+-----+
| 1|Bengal| East|
| 2| Goa| West|
| 3|Bombay| West|
| 4| Bihar| East|
+---+------+-----+

I could do only in pandas as below. Since the dataset is huge, I am trying do convert this into pyspark. Thanks.
Pandas code as below
def map_state(name):
#print(name)
East = ["Bengal", "Bihar", "Assam"]
West = ["Bombay", "Gujarat", "Goa"]
if name in East:
return 'East'
if name in West:
return 'West'
else:
return name
df['State'] = df['City'].apply(map_state)

Related

Pyspark Mean value of each element in multiple lists

I have a df with 2 columns:
id
vector
This is a sample of how it looks:
+--------------------+----------+
| vector| id|
+--------------------+----------+
|[8.32,3.22,5.34,6.5]|1046091128|
|[8.52,3.34,5.31,6.3]|1046091128|
|[8.44,3.62,5.54,6.4]|1046091128|
|[8.31,3.12,5.21,6.1]|1046091128|
+--------------------+----------+
I want to groupBy appid and take the mean of each element of the vectors. So for example the first value in the aggregated list will be (8.32+8.52+8.44+8.31)/4 and so on.
Any help is appreciated.
This assumes that you know the length of the array column:
l = 4 #size of array column
df1 = df.select("id",*[F.col("vector")[i] for i in range(l)])
out = df1.groupby("id").agg(F.array([F.mean(i)
for i in df1.columns[1:]]).alias("vector"))
out.show(truncate=False)
+----------+----------------------------------------+
|id |vector |
+----------+----------------------------------------+
|1046091128|[8.3975, 3.325, 5.35, 6.325000000000001]|
+----------+----------------------------------------
You can use posexplode function and then aggregate the column based upon average. Something like below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [([8.32,3.22,5.34,6.5], 1046091128 ), ([8.52,3.34,5.31,6.3], 1046091128), ([8.44,3.62,5.54,6.4], 1046091128), ([8.31,3.12,5.21,6.1], 1046091128)]
schema = StructType([ StructField("vector", ArrayType(FloatType())), StructField("id", IntegerType()) ])
df = spark.createDataFrame(data=data,schema=schema)
df.select("id", posexplode("vector")).groupBy("id").pivot("pos").agg(avg("col")).show()
Output would look somewhat like :
+----------+-----------------+------------------+-----------------+-----------------+
| id| 0| 1| 2| 3|
+----------+-----------------+------------------+-----------------+-----------------+
|1046091128|8.397500038146973|3.3249999284744263|5.350000023841858|6.325000047683716|
+----------+-----------------+------------------+-----------------+-----------------+
You can rename the columns later if required.
Could also avoid pivot by grouping by id and pos and then later grouping by id alone to collect_list
df.select("id", posexplode("vector")).groupby('id','pos').agg(avg('col').alias('vector')).groupby('id').agg(collect_list('vector').alias('vector')).show(truncate=False)
Outcome
+----------+-----------------------------------------------------------------------------+
|id |vector |
+----------+-----------------------------------------------------------------------------+
|1046091128|[8.397500038146973, 5.350000023841858, 3.3249999284744263, 6.325000047683716]|
+----------+-----------------------------------------------------------------------------+

PySpark using Regexp_extract and Col to Create Dataset

I need help creating a dataset that shows both the first name and last name of people who live in Texas and the area code of their phone numbers (phone1). This is the coding that I tried to use and this is the dataset that I was given.
from pyspark.sql.functions import regexp_extract, col
regexp_extract(col('first_name + last_name'), '.by\s+(\w+)', 1))
first_name last_name company_name address city county state zip phone1
Billy Thornton Qdoba 8142 Yougla Road Dallas Fort Worth TX 34218 689-956-0765
Joe Swanson Beachfront 9243 Trace Street Miami Dade FL 56432 890-780-9674
Kevin Knox MSG 7683 Brooklyn Ave New York New York NY 56987 850-342-1123
Bill Lamb AFT 6394 W Beast Dr Houston Galveston TX 32804 407-413-4842
Raylene Kampa Hermar Inc 2046 SW Nylin Rd Elkhart Elkhart IN 46514 574-499-1454
Now I see. Your phone number status is good to split, so use split.
df.show()
+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+
|first_name|last_name|company_name| address| city| county|state| zip| phone1|
+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+
| Billy| Thornton| Qdoba| 8142 Yougla Road| Dallas|Fort Worth| TX|34218|689-956-0765|
| Joe| Swanson| Beachfront|9243 Trace Street| Miami| Dade| FL|56432|890-780-9674|
| Kevin| Knox| MSG|7683 Brooklyn Ave|New York| New York| NY|56987|850-342-1123|
| Bill| Lamb| AFT| 6394 W Beast Dr| Houston| Galveston| TX|32804|407-413-4842|
| Raylene| Kampa| Hermar Inc| 2046 SW Nylin Rd| Elkhart| Elkhart| IN|46514|574-499-1454|
+----------+---------+------------+-----------------+--------+----------+-----+-----+------------+
df.filter("state = 'TX'") \
.withColumn('area_code', split('phone1', "-")[0].alias('area_code')) \
.select('first_name', 'last_name', 'state', 'area_code') \
.show()
+----------+---------+-----+---------+
|first_name|last_name|state|area_code|
+----------+---------+-----+---------+
| Billy| Thornton| TX| 689|
| Bill| Lamb| TX| 407|
+----------+---------+-----+---------+

Pandas Dataframe Wildcard Values in List

How can I filter a dataframe to rows with values that are contained within a list? Specifically, the values in the dataframe will only be partial matches with the list and never exact match.
I've tried using pandas.DataFrame.isin but this only works if the values in the dataframe are the same as in the list.
list = ["123 MAIN STREET", "456 BLUE ROAD", "789 SKY DRIVE"]
df =
address
0 123 MAIN
1 456 BLUE
2 987 PANDA
target_df = df[df["address"].isin(list)
Ideally the result should be
target_df =
address
0 123 MAIN
1 456 BLUE
Use str.contains and a simple regex using | to connect the terms.
f = '|'.join
mask = f(map(f, map(str.split, list)))
df[df.address.str.contains(mask)]
address
0 123 MAIN
1 456 BLUE
Ending up using for loop
df[[any(x in y for y in l) for x in df.address]]
Out[257]:
address
0 123 MAIN
1 456 BLUE

How to delete words from a dataframe column that are present in dictionary in Pandas

An extension to :
Removing list of words from a string
I have following dataframe and I want to delete frequently occuring words from df.name column:
df :
name
Bill Hayden
Rock Clinton
Bill Gates
Vishal James
James Cameroon
Micky James
Michael Clark
Tony Waugh
Tom Clark
Tom Bill
Avinash Clinton
Shreyas Clinton
Ramesh Clinton
Adam Clark
I'm creating a new dataframe with words and their frequency with following code :
df = pd.DataFrame(data.name.str.split(expand=True).stack().value_counts())
df.reset_index(level=0, inplace=True)
df.columns = ['word', 'freq']
df = df[df['freq'] >= 3]
which will result in
df2 :
word freq
Clinton 4
Bill 3
James 3
Clark 3
Then I'm converting it into a dictionary with following code snippet :
d = dict(zip(df['word'], df['freq']))
Now if I've to remove words from df.name that are in d(which is dictionary, with word : freq), I'm using following code snippet :
def check_thresh_word(merc,d):
m = merc.split(' ')
for i in range(len(m)):
if m[i] in d.keys():
return False
else:
return True
def rm_freq_occurences(merc,d):
if check_thresh_word(merc,d) == False:
nwords = merc.split(' ')
rwords = [word for word in nwords if word not in d.keys()]
m = ' '.join(rwords)
else:
m=merc
return m
df['new_name'] = df['name'].apply(lambda x: rm_freq_occurences(x,d))
But in actual my dataframe(df) contains nearly 240k rows and i've to use threshold(thresh=3 in above sample) greater than 100.
So above code takes lots of time to run because of complex search.
Is there any effiecient way to make it faster??
Following is a desired output :
name
Hayden
Rock
Gates
Vishal
Cameroon
Micky
Michael
Tony Waugh
Tom
Tommy
Avinash
Shreyas
Ramesh
Adam
Thanks in advance!!!!!!!
Use replace by regex created by joined all values of column word, last strip traling whitespaces:
data.name = data.name.replace('|'.join(df['word']), '', regex=True).str.strip()
Another solution is add \s* for select zero or more whitespaces:
pat = '|'.join(['\s*{}\s*'.format(x) for x in df['word']])
print (pat)
\s*Clinton\s*|\s*James\s*|\s*Bill\s*|\s*Clark\s*
data.name = data.name.replace(pat, '', regex=True)
print (data)
name
0 Hayden
1 Rock
2 Gates
3 Vishal
4 Cameroon
5 Micky
6 Michael
7 Tony Waugh
8 Tom
9 Tom
10 Avinash
11 Shreyas
12 Ramesh
13 Adam

Finding duplicate values based on condition

Below is the sample data:
1 ,ASIF JAVED IQBAL JAVED,JAVED IQBAL SO INAYATHULLAH,20170103
2 ,SYED MUSTZAR ALI MUHAMMAD ILYAS SHAH,MUHAMMAD SAFEER SO SAGHEER KHAN,20170127
3 ,AHSUN SABIR SABIR ALI,MISBAH NAVEED DO NAVEED ANJUM,20170116
4 ,RASHAD IQBAL PARVAIZ IQBAL,PERVAIZ IQBAL SO GUL HUSSAIN KHAN,20170104
5 ,RASHID ALI MUGHERI ABDUL RASOOL MUGHERI,MUMTAZ ALI BOHIO,20170105
6 ,FAKHAR IMAM AHMAD ALI,MOHAMMAD AKHLAQ ASHIQ HUSSAIN,20170105
7 ,AQEEL SARWAR MUHAMMAD SARWAR BUTT,BUSHRA WAHID,20170106
8 ,SHAFAQAT ALI REHMAT ALI,SAJIDA BIBI WO MUHAMMAD ASHRAF,20170106
9 ,MUHAMMAD ISMAIL SHAFQAT HUSSAIN,USAMA IQBAL,20170103
10 ,SULEMAN ALI KACHI GHULAM ALI,MUHAMMAD SHARIF ALLAH WARAYO,20170109
1st is serial #, 2nd is sender, 3rd is receiver, 4th is date
and this data goes on for like million rows.
Now, i want to find which same sender sends the parcel to same receiver on the same date.
I wrote the following basic code for this but its very slow.
import csv
from fuzzywuzzy import fuzz
serial = []
agency = []
rem_name = []
rem_name2 = []
date = []
with open('janCSV.csv') as f:
reader = csv.reader(f)
for row in reader:
serial.append(row[0])
rem_name.append(row[2])
rem_name2.append(row[2])
date.append(row[4])
with open('output.csv', 'w') as out:
for rem1 in rem_name:
date1 = date[rem_name.index(rem1)]
serial1 = serial[rem_name.index(rem1)]
for rem2 in rem_name2:
date2 = date[rem_name2.index(rem2)]
if date1 == date2:
ratio = fuzz.ratio(rem1, rem2)
if ratio >= 90 and ratio < 100:
print serial1, rem1, rem2, date1, date2, ratio
out.write(str(serial1) + ',' + str(date1) + ',' + str(date2) + ',' + str(rem1) + ',' + str(rem2) + ','
+ str(ratio) + '\n')