Pyspark Merge WrappedArrays Within a Dataframe

Pyspark Merge WrappedArrays Within a Dataframe - python-2.7

The current Pyspark dataframe has this structure (a list of WrappedArrays for col2):
+---+---------------------------------------------------------------------+
|id |col2 |
+---+---------------------------------------------------------------------+
|a |[WrappedArray(code2), WrappedArray(code1, code3)] |
+---+---------------------------------------------------------------------+
|b |[WrappedArray(code5), WrappedArray(code6, code8)] |
+---+---------------------------------------------------------------------+
This is the structure I would like to have (a flattened list for col2):
+---+---------------------------------------------------------------------+
|id |col2 |
+---+---------------------------------------------------------------------+
|a |[code2,code1, code3)] |
+---+---------------------------------------------------------------------+
|b |[code5,code6, code8] |
+---+---------------------------------------------------------------------+
but I'm not sure how to do that transformation. I had tried to do a flatmap but that didn't seem to work. Any suggestions?

You can do this using 2 ways, udf and rdd. Here is example:-
df = sqlContext.createDataFrame([
['a', [['code2'],['code1', 'code3']]],
['b', [['code5','code6'], ['code8']]]
], ["id", "col2"])
df.show(truncate = False)
+---+-------------------------------------------------+
|id |col2 |
+---+-------------------------------------------------+
|a |[WrappedArray(code2), WrappedArray(code1, code3)]|
|b |[WrappedArray(code5, code6), WrappedArray(code8)]|
+---+-------------------------------------------------+
RDD:-
df.map(lambda row:(row[0], reduce(lambda x,y:x+y, row[1]))).toDF().show(truncate=False)
+---+---------------------+
|_1 |_2 |
+---+---------------------+
|a |[code2, code1, code3]|
|b |[code5, code6, code8]|
+---+---------------------+
UDF:-
from pyspark.sql import functions as F
import pyspark.sql.types as T
def fudf(val):
#emlist = []
#for item in val:
# emlist += item
#return emlist
return reduce (lambda x, y:x+y, val)
flattenUdf = F.udf(fudf, T.ArrayType(T.StringType()))
df.select("id", flattenUdf("col2").alias("col2")).show(truncate=False)
+---+---------------------+
|id |col2 |
+---+---------------------+
|a |[code2, code1, code3]|
|b |[code5, code6, code8]|
+---+---------------------+

Related

pyspark concatonate multiple columns where the column name is in a list

I am trying to concatonate multiple columns to just one column, but only if the column name is in a list.
so issue = {'a','b','c'} is my list and would need to concatonate it as issue column with ; seperator.
I have tried:
1.
df_issue = df.withColumn('issue', concat_ws(';',map_values(custom.({issue}))))
Which returns invalid syntax error
2.
df_issue = df.withColumn('issue', lit(issue))
this just returnd a b c and not their value
Thank you
I have tried:
1.
df_issue = df.withColumn('issue', concat_ws(';',map_values(custom.({issue}))))
Which returns invalid syntax error
2.
df_issue = df.withColumn('issue', lit(issue))
this just returnd a b c and not their value
Thank you

You can simply use concat_ws:
from pyspark.sql import functions as F
columns_to_concat = ['a', 'b', 'c']
df.withColumn('issue', F.concat_ws(';', *columns_to_concat))
So, if your input DataFrame is:
+---+---+---+----------+----------+-----+
| a| b| c| date1| date2|value|
+---+---+---+----------+----------+-----+
| k1| k2| k3|2022-11-11|2022-11-14| 5|
| k4| k5| k6|2022-11-15|2022-11-19| 5|
| k7| k8| k9|2022-11-15|2022-11-19| 5|
+---+---+---+----------+----------+-----+
The previous code will produce:
+---+---+---+----------+----------+-----+--------+
| a| b| c| date1| date2|value| issue|
+---+---+---+----------+----------+-----+--------+
| k1| k2| k3|2022-11-11|2022-11-14| 5|k1;k2;k3|
| k4| k5| k6|2022-11-15|2022-11-19| 5|k4;k5;k6|
| k7| k8| k9|2022-11-15|2022-11-19| 5|k7;k8;k9|
+---+---+---+----------+----------+-----+--------+

How can I compare pairs of columns in a PySpark dataframe and number of records changed?

I have a situation where I need to compare multiple pairs of columns (the number of pairs will vary and can come from a list as shown in below code snippet) and get 1/0 flag for match/mismatch respectively. Eventually use this to identify the number of records/rows with mismatch and % records mismatched
NONKEYCOLS= ['Marks', 'Qualification']
The first image is source df and second image is expected df.
[
Since this is happening for multiple pairs on a loop, it is very slow for about a billion records. Need help with something efficient.
I have the below code but the part that calculates change records is taking long time.
for ind,cols in enumerate(NONKEYCOLS):
print(ind)
print(cols)
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed" + str(ind),\
F.sum(col("records_ch_flag_" + str(ind)))\
.over(w1))
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed" + str(ind),\
F.sum(col("records_ch_flag_" + str(ind)))\
.over(w1))
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed_cnt" + str(ind),\
F.count(col("records_ch_flag_" + str(ind)))\
.over(w1))

i'm not sure what loop are you running, but here's an implementation with list comprehension within a select.
data_ls = [
(10, 11, 'foo', 'foo'),
(12, 12, 'bar', 'bar'),
(10, 12, 'foo', 'bar')
]
data_sdf = spark.sparkContext.parallelize(data_ls). \
toDF(['marks_1', 'marks_2', 'qualification_1', 'qualification_2'])
col_pairs = ['marks','qualification']
data_sdf. \
select('*',
*[(func.col(c+'_1') == func.col(c+'_2')).cast('int').alias(c+'_check') for c in col_pairs]
). \
show()
# +-------+-------+---------------+---------------+-----------+-------------------+
# |marks_1|marks_2|qualification_1|qualification_2|marks_check|qualification_check|
# +-------+-------+---------------+---------------+-----------+-------------------+
# | 10| 11| foo| foo| 0| 1|
# | 12| 12| bar| bar| 1| 1|
# | 10| 12| foo| bar| 0| 0|
# +-------+-------+---------------+---------------+-----------+-------------------+
where the list comprehension would yield the following
[(func.col(c+'_1') == func.col(c+'_2')).cast('int').alias(c+'_check') for c in col_pairs]
# [Column<'CAST((marks_1 = marks_2) AS INT) AS `marks_check`'>,
# Column<'CAST((qualification_1 = qualification_2) AS INT) AS `qualification_check`'>]
EDIT
based on the additional (updated) info, you need the count of unmatched records for that pair and then you want to calculate the unmatched percentage.
reversing the aforementioned logic to count the unmatched records
col_pairs = ['marks','qualification']
data_sdf. \
agg(*[func.sum((func.col(c+'_1') != func.col(c+'_2')).cast('int')).alias(c+'_unmatch') for c in col_pairs],
func.count('*').alias('row_cnt')
). \
select('*',
*[(func.col(c+'_unmatch') / func.col('row_cnt')).alias(c+'_unmatch_perc') for c in col_pairs]
). \
show()
# +-------------+---------------------+-------+------------------+--------------------------+
# |marks_unmatch|qualification_unmatch|row_cnt|marks_unmatch_perc|qualification_unmatch_perc|
# +-------------+---------------------+-------+------------------+--------------------------+
# | 2| 1| 3|0.6666666666666666| 0.3333333333333333|
# +-------------+---------------------+-------+------------------+--------------------------+
the code flags (as 1) the records where the pair does not match and takes a sum of the flag - which gives us the pair's unmatched record count. dividing that with the total row count will give the percentage.
the list comprehension will yield the following
[func.sum((func.col(c+'_1') != func.col(c+'_2')).cast('int')).alias(c+'_unmatch') for c in col_pairs]
# [Column<'sum(CAST((NOT (marks_1 = marks_2)) AS INT)) AS `marks_unmatch`'>,
# Column<'sum(CAST((NOT (qualification_1 = qualification_2)) AS INT)) AS `qualification_unmatch`'>]
this is very much efficient as all of it happens in a single select statement which will only project once in the spark plan as opposed to your approach which will project every time you do a withColumn - and that is inefficient to spark.

df.colRegex may serve you well. If all the values in columns which match the regex are equal, you get 1. The script is efficient, as everything is done in one select.
Inputs:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('p', 1, 2, 'g', 'm'),
('a', 3, 3, 'g', 'g'),
('b', 4, 5, 'g', 'g'),
('r', 8, 8, 'm', 'm'),
('d', 2, 1, 'u', 'g')],
['Name', 'Marks_1', 'Marks_2', 'Qualification_1', 'Qualification_2'])
col_pairs = ['Marks', 'Qualification']
Script:
def equals(*cols):
return (F.size(F.array_distinct(F.array(*cols))) == 1).cast('int')
df = df.select(
'*',
*[equals(df.colRegex(f"`^{c}.*`")).alias(f'{c}_result') for c in col_pairs]
)
df.show()
# +----+-------+-------+---------------+---------------+------------+--------------------+
# |Name|Marks_1|Marks_2|Qualification_1|Qualification_2|Marks_result|Qualification_result|
# +----+-------+-------+---------------+---------------+------------+--------------------+
# | p| 1| 2| g| m| 0| 0|
# | a| 3| 3| g| g| 1| 1|
# | b| 4| 5| g| g| 0| 1|
# | r| 8| 8| m| m| 1| 1|
# | d| 2| 1| u| g| 0| 0|
# +----+-------+-------+---------------+---------------+------------+--------------------+
Proof of efficiency:
df.explain()
# == Physical Plan ==
# *(1) Project [Name#636, Marks_1#637L, Marks_2#638L, Qualification_1#639, Qualification_2#640, cast((size(array_distinct(array(Marks_1#637L, Marks_2#638L)), true) = 1) as int) AS Marks_result#646, cast((size(array_distinct(array(Qualification_1#639, Qualification_2#640)), true) = 1) as int) AS Qualification_result#647]
# +- Scan ExistingRDD[Name#636,Marks_1#637L,Marks_2#638L,Qualification_1#639,Qualification_2#640]
Edit:
def equals(*cols):
return (F.size(F.array_distinct(F.array(*cols))) != 1).cast('int')
df = df.select(
'*',
*[equals(df.colRegex(f"`^{c}.*`")).alias(f'{c}_result') for c in col_pairs]
).agg(
*[F.sum(f'{c}_result').alias(f'rec_changed_{c}') for c in col_pairs],
*[(F.sum(f'{c}_result') / F.count(f'{c}_result')).alias(f'{c}_%_rec_changed') for c in col_pairs]
)
df.show()
# +-----------------+-------------------------+-------------------+---------------------------+
# |rec_changed_Marks|rec_changed_Qualification|Marks_%_rec_changed|Qualification_%_rec_changed|
# +-----------------+-------------------------+-------------------+---------------------------+
# | 3| 2| 0.6| 0.4|
# +-----------------+-------------------------+-------------------+---------------------------+

Using regex function on date in Pyspark

I need to validate date(string format) in Pyspark Dataframe and I need to remove additonal characters,notations in date if they are present. How to validate like that ?
I came across this code
regex_string='\/](19|[2-9][0-9])\d\d$)|(^29[\/]02[\/](19|[2-9][0-9])(00|04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96)$)'
df.select(regexp_extract(col("date"),regex_string,0).alias("cleaned_map"),col('date')).show()
Below is my output
+-----------+-----------+
|cleaned_map| date|
+-----------+-----------+
| |01/06/w2020|
| |02/06/2!020|
| 02/06/2020| 02/06/2020|
| 03/06/2020| 03/06/2020|
| 04/06/2020| 04/06/2020|
| 05/06/2020| 05/06/2020|
| 02/06/2020| 02/06/2020|
+-----------+-----------+
My expected output
+-----------+-----------+
|cleaned_map| date|
+-----------+-----------+
| 01/06/2020|01/06/w2020|
| 02/06/2020|02/06/20!20|
| 03/06/2020| 03/06/2020|
| 04/06/2020| 04/06/2020|
| 05/06/2020| 05/06/2020|
| 06/06/2020| 06/06/2020|
| 07/06/2020| 07/06/2020|
+-----------+-----------+

Try this-
val df = Seq("01/06/w2020",
"02/06/2!020",
"02/06/2020",
"03/06/2020",
"04/06/2020",
"05/06/2020",
"02/06/2020",
"//01/0/4/202/0").toDF("date")
df.withColumn("cleaned_map", regexp_replace($"date", "[^0-9T]", ""))
.withColumn("date_type", to_date($"cleaned_map", "ddMMyyyy"))
.show(false)
/**
* +--------------+-----------+----------+
* |date |cleaned_map|date_type |
* +--------------+-----------+----------+
* |01/06/w2020 |01062020 |2020-06-01|
* |02/06/2!020 |02062020 |2020-06-02|
* |02/06/2020 |02062020 |2020-06-02|
* |03/06/2020 |03062020 |2020-06-03|
* |04/06/2020 |04062020 |2020-06-04|
* |05/06/2020 |05062020 |2020-06-05|
* |02/06/2020 |02062020 |2020-06-02|
* |//01/0/4/202/0|01042020 |2020-04-01|
* +--------------+-----------+----------+
*/
enrich this pattern "[^0-9/T]" if you want exclude any chars to be removed

Try regexp_replace to remove additional character notations.
df.show()
# +-----------+
# | date|
# +-----------+
# |01/06/w2020|
# |02/06/2!020|
# | 02/06/2020|
# +-----------+
df.withColumn("cleaned_map", F.regexp_replace("date", r'[^\d\/]','')).show()
# +-----------+-----------+
# | date|cleaned_map|
# +-----------+-----------+
# |01/06/w2020| 01/06/2020|
# |02/06/2!020| 02/06/2020|
# | 02/06/2020| 02/06/2020|
# +-----------+-----------+

create new column in pyspark dataframe using existing columns

I am trying to work with pyspark dataframes and I would like to know how I can create and populate new column using existing columns.
Lets say I have a dataframe that looks like this:
+-----+---+---+
| _1| _2| _3|
+-----+---+---+
|x1-y1| 3| z1|
|x2-y2| 2| z2|
|x3-y3| 1| z3|
+-----+---+---+
I am looking for way to create a dataframe which looks like this:
+-----+---+---+----+--------+
| _1| _2| _3| _4| _5|
+-----+---+---+----+--------+
|x1-y1| 3| z1|x1y1|x1=y1=z1|
|x2-y2| 2| z2|x2y2|x2=y2=z2|
|x3-y3| 1| z3|x3y3|x3=y3=z3|
+-----+---+---+----+--------+
_4 is just '-' removed from _1 and _5 uses values from _1 and _3
I am using spark-2.3.3 and python 2.7
Thanks!

You can use pyspark.sql.functions to achieve it.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
sqlContext = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
data = [('x1-y1', 3,'z1'),
('x2-y2', 2,'z2'),
('x3-y3', 1,'z3')]
test_df = sqlContext.createDataFrame(data, schema=['_1', '_2', '_3'])
test_df = test_df.withColumn('_4', F.regexp_replace('_1', '-', ''))
test_df = test_df.withColumn('_5', F.concat(F.regexp_replace('_1', '-', '='),F.lit('='),F.col('_3')))
test_df.show()
+-----+---+---+----+--------+
| _1| _2| _3| _4| _5|
+-----+---+---+----+--------+
|x1-y1| 3| z1|x1y1|x1=y1=z1|
|x2-y2| 2| z1|x2y2|x2=y2=z2|
|x3-y3| 1| z1|x3y3|x3=y3=z3|
+-----+---+---+----+--------+

Get a second header with the units of columns

Sometimes in academic texts one wants to present a Table in which every column has units. It is usual that the units are specified below the column names, like this
|Object |Volume | area | Price |
| |$cm^3$ |$cm^2$ | euros |
|:------------|:-------|--------:|---------:|
|A |3 | 43.36| 567.40|
|B |15 | 43.47| 1000.80|
|C |1 | 42.18| 8.81|
|D |7 | 37.92| 4.72|
How could I achieve this for my bookdown documents?
Thank you in advance.

Here is a way using kableExtra:
```{r}
library(kableExtra)
df <- data.frame(Object = LETTERS[1:5],
Volume = round(runif(5, 1, 20)),
area = rnorm(5, 40, 3),
Price = rnorm(5, 700, 200))
colNames <- names(df)
dfUnits <- c("", "$cm^3$", "$cm^2$", "€")
kable(df, col.names = dfUnits,escape = F, align = "c") %>%
add_header_above(header = colNames, line = F, align = "c")
```

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pyspark Merge WrappedArrays Within a Dataframe - python-2.7

Related

pyspark concatonate multiple columns where the column name is in a list

How can I compare pairs of columns in a PySpark dataframe and number of records changed?

Using regex function on date in Pyspark

create new column in pyspark dataframe using existing columns

Get a second header with the units of columns

Categories

Resources