The current Pyspark dataframe has this structure (a list of WrappedArrays for col2):
+---+---------------------------------------------------------------------+
|id |col2 |
+---+---------------------------------------------------------------------+
|a |[WrappedArray(code2), WrappedArray(code1, code3)] |
+---+---------------------------------------------------------------------+
|b |[WrappedArray(code5), WrappedArray(code6, code8)] |
+---+---------------------------------------------------------------------+
This is the structure I would like to have (a flattened list for col2):
+---+---------------------------------------------------------------------+
|id |col2 |
+---+---------------------------------------------------------------------+
|a |[code2,code1, code3)] |
+---+---------------------------------------------------------------------+
|b |[code5,code6, code8] |
+---+---------------------------------------------------------------------+
but I'm not sure how to do that transformation. I had tried to do a flatmap but that didn't seem to work. Any suggestions?
You can do this using 2 ways, udf and rdd. Here is example:-
df = sqlContext.createDataFrame([
['a', [['code2'],['code1', 'code3']]],
['b', [['code5','code6'], ['code8']]]
], ["id", "col2"])
df.show(truncate = False)
+---+-------------------------------------------------+
|id |col2 |
+---+-------------------------------------------------+
|a |[WrappedArray(code2), WrappedArray(code1, code3)]|
|b |[WrappedArray(code5, code6), WrappedArray(code8)]|
+---+-------------------------------------------------+
RDD:-
df.map(lambda row:(row[0], reduce(lambda x,y:x+y, row[1]))).toDF().show(truncate=False)
+---+---------------------+
|_1 |_2 |
+---+---------------------+
|a |[code2, code1, code3]|
|b |[code5, code6, code8]|
+---+---------------------+
UDF:-
from pyspark.sql import functions as F
import pyspark.sql.types as T
def fudf(val):
#emlist = []
#for item in val:
# emlist += item
#return emlist
return reduce (lambda x, y:x+y, val)
flattenUdf = F.udf(fudf, T.ArrayType(T.StringType()))
df.select("id", flattenUdf("col2").alias("col2")).show(truncate=False)
+---+---------------------+
|id |col2 |
+---+---------------------+
|a |[code2, code1, code3]|
|b |[code5, code6, code8]|
+---+---------------------+
Related
I am trying to concatonate multiple columns to just one column, but only if the column name is in a list.
so issue = {'a','b','c'} is my list and would need to concatonate it as issue column with ; seperator.
I have tried:
1.
df_issue = df.withColumn('issue', concat_ws(';',map_values(custom.({issue}))))
Which returns invalid syntax error
2.
df_issue = df.withColumn('issue', lit(issue))
this just returnd a b c and not their value
Thank you
I have tried:
1.
df_issue = df.withColumn('issue', concat_ws(';',map_values(custom.({issue}))))
Which returns invalid syntax error
2.
df_issue = df.withColumn('issue', lit(issue))
this just returnd a b c and not their value
Thank you
You can simply use concat_ws:
from pyspark.sql import functions as F
columns_to_concat = ['a', 'b', 'c']
df.withColumn('issue', F.concat_ws(';', *columns_to_concat))
So, if your input DataFrame is:
+---+---+---+----------+----------+-----+
| a| b| c| date1| date2|value|
+---+---+---+----------+----------+-----+
| k1| k2| k3|2022-11-11|2022-11-14| 5|
| k4| k5| k6|2022-11-15|2022-11-19| 5|
| k7| k8| k9|2022-11-15|2022-11-19| 5|
+---+---+---+----------+----------+-----+
The previous code will produce:
+---+---+---+----------+----------+-----+--------+
| a| b| c| date1| date2|value| issue|
+---+---+---+----------+----------+-----+--------+
| k1| k2| k3|2022-11-11|2022-11-14| 5|k1;k2;k3|
| k4| k5| k6|2022-11-15|2022-11-19| 5|k4;k5;k6|
| k7| k8| k9|2022-11-15|2022-11-19| 5|k7;k8;k9|
+---+---+---+----------+----------+-----+--------+
I have a situation where I need to compare multiple pairs of columns (the number of pairs will vary and can come from a list as shown in below code snippet) and get 1/0 flag for match/mismatch respectively. Eventually use this to identify the number of records/rows with mismatch and % records mismatched
NONKEYCOLS= ['Marks', 'Qualification']
The first image is source df and second image is expected df.
[
Since this is happening for multiple pairs on a loop, it is very slow for about a billion records. Need help with something efficient.
I have the below code but the part that calculates change records is taking long time.
for ind,cols in enumerate(NONKEYCOLS):
print(ind)
print(cols)
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed" + str(ind),\
F.sum(col("records_ch_flag_" + str(ind)))\
.over(w1))
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed" + str(ind),\
F.sum(col("records_ch_flag_" + str(ind)))\
.over(w1))
globals()['new_dataset' + '_char_changes_tmp']=globals()['new_dataset' + '_char_changes_tmp']\
.withColumn("records_changed_cnt" + str(ind),\
F.count(col("records_ch_flag_" + str(ind)))\
.over(w1))
i'm not sure what loop are you running, but here's an implementation with list comprehension within a select.
data_ls = [
(10, 11, 'foo', 'foo'),
(12, 12, 'bar', 'bar'),
(10, 12, 'foo', 'bar')
]
data_sdf = spark.sparkContext.parallelize(data_ls). \
toDF(['marks_1', 'marks_2', 'qualification_1', 'qualification_2'])
col_pairs = ['marks','qualification']
data_sdf. \
select('*',
*[(func.col(c+'_1') == func.col(c+'_2')).cast('int').alias(c+'_check') for c in col_pairs]
). \
show()
# +-------+-------+---------------+---------------+-----------+-------------------+
# |marks_1|marks_2|qualification_1|qualification_2|marks_check|qualification_check|
# +-------+-------+---------------+---------------+-----------+-------------------+
# | 10| 11| foo| foo| 0| 1|
# | 12| 12| bar| bar| 1| 1|
# | 10| 12| foo| bar| 0| 0|
# +-------+-------+---------------+---------------+-----------+-------------------+
where the list comprehension would yield the following
[(func.col(c+'_1') == func.col(c+'_2')).cast('int').alias(c+'_check') for c in col_pairs]
# [Column<'CAST((marks_1 = marks_2) AS INT) AS `marks_check`'>,
# Column<'CAST((qualification_1 = qualification_2) AS INT) AS `qualification_check`'>]
EDIT
based on the additional (updated) info, you need the count of unmatched records for that pair and then you want to calculate the unmatched percentage.
reversing the aforementioned logic to count the unmatched records
col_pairs = ['marks','qualification']
data_sdf. \
agg(*[func.sum((func.col(c+'_1') != func.col(c+'_2')).cast('int')).alias(c+'_unmatch') for c in col_pairs],
func.count('*').alias('row_cnt')
). \
select('*',
*[(func.col(c+'_unmatch') / func.col('row_cnt')).alias(c+'_unmatch_perc') for c in col_pairs]
). \
show()
# +-------------+---------------------+-------+------------------+--------------------------+
# |marks_unmatch|qualification_unmatch|row_cnt|marks_unmatch_perc|qualification_unmatch_perc|
# +-------------+---------------------+-------+------------------+--------------------------+
# | 2| 1| 3|0.6666666666666666| 0.3333333333333333|
# +-------------+---------------------+-------+------------------+--------------------------+
the code flags (as 1) the records where the pair does not match and takes a sum of the flag - which gives us the pair's unmatched record count. dividing that with the total row count will give the percentage.
the list comprehension will yield the following
[func.sum((func.col(c+'_1') != func.col(c+'_2')).cast('int')).alias(c+'_unmatch') for c in col_pairs]
# [Column<'sum(CAST((NOT (marks_1 = marks_2)) AS INT)) AS `marks_unmatch`'>,
# Column<'sum(CAST((NOT (qualification_1 = qualification_2)) AS INT)) AS `qualification_unmatch`'>]
this is very much efficient as all of it happens in a single select statement which will only project once in the spark plan as opposed to your approach which will project every time you do a withColumn - and that is inefficient to spark.
df.colRegex may serve you well. If all the values in columns which match the regex are equal, you get 1. The script is efficient, as everything is done in one select.
Inputs:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('p', 1, 2, 'g', 'm'),
('a', 3, 3, 'g', 'g'),
('b', 4, 5, 'g', 'g'),
('r', 8, 8, 'm', 'm'),
('d', 2, 1, 'u', 'g')],
['Name', 'Marks_1', 'Marks_2', 'Qualification_1', 'Qualification_2'])
col_pairs = ['Marks', 'Qualification']
Script:
def equals(*cols):
return (F.size(F.array_distinct(F.array(*cols))) == 1).cast('int')
df = df.select(
'*',
*[equals(df.colRegex(f"`^{c}.*`")).alias(f'{c}_result') for c in col_pairs]
)
df.show()
# +----+-------+-------+---------------+---------------+------------+--------------------+
# |Name|Marks_1|Marks_2|Qualification_1|Qualification_2|Marks_result|Qualification_result|
# +----+-------+-------+---------------+---------------+------------+--------------------+
# | p| 1| 2| g| m| 0| 0|
# | a| 3| 3| g| g| 1| 1|
# | b| 4| 5| g| g| 0| 1|
# | r| 8| 8| m| m| 1| 1|
# | d| 2| 1| u| g| 0| 0|
# +----+-------+-------+---------------+---------------+------------+--------------------+
Proof of efficiency:
df.explain()
# == Physical Plan ==
# *(1) Project [Name#636, Marks_1#637L, Marks_2#638L, Qualification_1#639, Qualification_2#640, cast((size(array_distinct(array(Marks_1#637L, Marks_2#638L)), true) = 1) as int) AS Marks_result#646, cast((size(array_distinct(array(Qualification_1#639, Qualification_2#640)), true) = 1) as int) AS Qualification_result#647]
# +- Scan ExistingRDD[Name#636,Marks_1#637L,Marks_2#638L,Qualification_1#639,Qualification_2#640]
Edit:
def equals(*cols):
return (F.size(F.array_distinct(F.array(*cols))) != 1).cast('int')
df = df.select(
'*',
*[equals(df.colRegex(f"`^{c}.*`")).alias(f'{c}_result') for c in col_pairs]
).agg(
*[F.sum(f'{c}_result').alias(f'rec_changed_{c}') for c in col_pairs],
*[(F.sum(f'{c}_result') / F.count(f'{c}_result')).alias(f'{c}_%_rec_changed') for c in col_pairs]
)
df.show()
# +-----------------+-------------------------+-------------------+---------------------------+
# |rec_changed_Marks|rec_changed_Qualification|Marks_%_rec_changed|Qualification_%_rec_changed|
# +-----------------+-------------------------+-------------------+---------------------------+
# | 3| 2| 0.6| 0.4|
# +-----------------+-------------------------+-------------------+---------------------------+
I need to validate date(string format) in Pyspark Dataframe and I need to remove additonal characters,notations in date if they are present. How to validate like that ?
I came across this code
regex_string='\/](19|[2-9][0-9])\d\d$)|(^29[\/]02[\/](19|[2-9][0-9])(00|04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96)$)'
df.select(regexp_extract(col("date"),regex_string,0).alias("cleaned_map"),col('date')).show()
Below is my output
+-----------+-----------+
|cleaned_map| date|
+-----------+-----------+
| |01/06/w2020|
| |02/06/2!020|
| 02/06/2020| 02/06/2020|
| 03/06/2020| 03/06/2020|
| 04/06/2020| 04/06/2020|
| 05/06/2020| 05/06/2020|
| 02/06/2020| 02/06/2020|
+-----------+-----------+
My expected output
+-----------+-----------+
|cleaned_map| date|
+-----------+-----------+
| 01/06/2020|01/06/w2020|
| 02/06/2020|02/06/20!20|
| 03/06/2020| 03/06/2020|
| 04/06/2020| 04/06/2020|
| 05/06/2020| 05/06/2020|
| 06/06/2020| 06/06/2020|
| 07/06/2020| 07/06/2020|
+-----------+-----------+
Try this-
val df = Seq("01/06/w2020",
"02/06/2!020",
"02/06/2020",
"03/06/2020",
"04/06/2020",
"05/06/2020",
"02/06/2020",
"//01/0/4/202/0").toDF("date")
df.withColumn("cleaned_map", regexp_replace($"date", "[^0-9T]", ""))
.withColumn("date_type", to_date($"cleaned_map", "ddMMyyyy"))
.show(false)
/**
* +--------------+-----------+----------+
* |date |cleaned_map|date_type |
* +--------------+-----------+----------+
* |01/06/w2020 |01062020 |2020-06-01|
* |02/06/2!020 |02062020 |2020-06-02|
* |02/06/2020 |02062020 |2020-06-02|
* |03/06/2020 |03062020 |2020-06-03|
* |04/06/2020 |04062020 |2020-06-04|
* |05/06/2020 |05062020 |2020-06-05|
* |02/06/2020 |02062020 |2020-06-02|
* |//01/0/4/202/0|01042020 |2020-04-01|
* +--------------+-----------+----------+
*/
enrich this pattern "[^0-9/T]" if you want exclude any chars to be removed
Try regexp_replace to remove additional character notations.
df.show()
# +-----------+
# | date|
# +-----------+
# |01/06/w2020|
# |02/06/2!020|
# | 02/06/2020|
# +-----------+
df.withColumn("cleaned_map", F.regexp_replace("date", r'[^\d\/]','')).show()
# +-----------+-----------+
# | date|cleaned_map|
# +-----------+-----------+
# |01/06/w2020| 01/06/2020|
# |02/06/2!020| 02/06/2020|
# | 02/06/2020| 02/06/2020|
# +-----------+-----------+
I am trying to work with pyspark dataframes and I would like to know how I can create and populate new column using existing columns.
Lets say I have a dataframe that looks like this:
+-----+---+---+
| _1| _2| _3|
+-----+---+---+
|x1-y1| 3| z1|
|x2-y2| 2| z2|
|x3-y3| 1| z3|
+-----+---+---+
I am looking for way to create a dataframe which looks like this:
+-----+---+---+----+--------+
| _1| _2| _3| _4| _5|
+-----+---+---+----+--------+
|x1-y1| 3| z1|x1y1|x1=y1=z1|
|x2-y2| 2| z2|x2y2|x2=y2=z2|
|x3-y3| 1| z3|x3y3|x3=y3=z3|
+-----+---+---+----+--------+
_4 is just '-' removed from _1 and _5 uses values from _1 and _3
I am using spark-2.3.3 and python 2.7
Thanks!
You can use pyspark.sql.functions to achieve it.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
sqlContext = SparkSession.builder.appName("test").enableHiveSupport().getOrCreate()
data = [('x1-y1', 3,'z1'),
('x2-y2', 2,'z2'),
('x3-y3', 1,'z3')]
test_df = sqlContext.createDataFrame(data, schema=['_1', '_2', '_3'])
test_df = test_df.withColumn('_4', F.regexp_replace('_1', '-', ''))
test_df = test_df.withColumn('_5', F.concat(F.regexp_replace('_1', '-', '='),F.lit('='),F.col('_3')))
test_df.show()
+-----+---+---+----+--------+
| _1| _2| _3| _4| _5|
+-----+---+---+----+--------+
|x1-y1| 3| z1|x1y1|x1=y1=z1|
|x2-y2| 2| z1|x2y2|x2=y2=z2|
|x3-y3| 1| z1|x3y3|x3=y3=z3|
+-----+---+---+----+--------+
Sometimes in academic texts one wants to present a Table in which every column has units. It is usual that the units are specified below the column names, like this
|Object |Volume | area | Price |
| |$cm^3$ |$cm^2$ | euros |
|:------------|:-------|--------:|---------:|
|A |3 | 43.36| 567.40|
|B |15 | 43.47| 1000.80|
|C |1 | 42.18| 8.81|
|D |7 | 37.92| 4.72|
How could I achieve this for my bookdown documents?
Thank you in advance.
Here is a way using kableExtra:
```{r}
library(kableExtra)
df <- data.frame(Object = LETTERS[1:5],
Volume = round(runif(5, 1, 20)),
area = rnorm(5, 40, 3),
Price = rnorm(5, 700, 200))
colNames <- names(df)
dfUnits <- c("", "$cm^3$", "$cm^2$", "€")
kable(df, col.names = dfUnits,escape = F, align = "c") %>%
add_header_above(header = colNames, line = F, align = "c")
```