spark regex while join data frame - regex

I need to write some regex for condition check in spark while doing some join,
My regex should match below string
n3_testindia1 = test-india-1
n2_stagamerica2 = stag-america-2
n1_prodeurope2 = prod-europe-2
df1.select("location1").distinct.show()
+----------------+
| location1 |
+----------------+
|n3_testindia1 |
|n2_stagamerica2 |
|n1_prodeurope2 |
df2.select("loc1").distinct.show()
+--------------+
| loc1 |
+--------------+
|test-india-1 |
|stag-america-2|
|prod-europe-2 |
+--------------+
I want to join based on location columns like below
val joindf = df1.join(df2, df1("location1") == regex(df2("loc1")))

Based on the information above you can do that in Spark 2.4.0 using
val joindf = df1.join(df2,
regexp_extract(df1("location1"), """[^_]+_(.*)""", 1)
=== translate(df2("loc1"), "-", ""))
Or in prior versions something like
val joindf = df1.join(df2,
df1("location1").substr(lit(4), length(df1("location1")))
=== translate(df2("loc1"), "-", ""))

You can split by "_" in location1 and take the 2 element, then match with the entire string of "-" removed string in loc1. Check this out:
scala> val df1 = Seq(("n3_testindia1"),("n2_stagamerica2"),("n1_prodeurope2")).toDF("location1")
df1: org.apache.spark.sql.DataFrame = [location1: string]
scala> val df2 = Seq(("test-india-1"),("stag-america-2"),("prod-europe-2")).toDF("loc1")
df2: org.apache.spark.sql.DataFrame = [loc1: string]
scala> df1.join(df2,split('location1,"_")(1) === regexp_replace('loc1,"-",""),"inner").show
+---------------+--------------+
| location1| loc1|
+---------------+--------------+
| n3_testindia1| test-india-1|
|n2_stagamerica2|stag-america-2|
| n1_prodeurope2| prod-europe-2|
+---------------+--------------+
scala>

Related

Scala spark how to interact with a List[Option[Map[String, DataFrame]]]

I'm trying to interact with this List[Option[Map[String, DataFrame]]] but I'm having a bit of trouble.
Inside it has something like this:
customer1 -> dataframeX
customer2 -> dataframeY
customer3 -> dataframeZ
Where the customer is an identifier that will become a new column.
I need to do an union of dataframeX, dataframeY and dataframeZ (all df have the same columns). Before I had this:
map(_.get).reduce(_ union _).select(columns:_*)
And it was working fine because I only had a List[Option[DataFrame]] and didn't need the identifier but I'm having trouble with the new list. My idea is to modify my old mapping, I know I can do stuff like "(0).get" and that would bring me "Map(customer1 -> dataframeX)" but I'm not quite sure how to do that iteration in the mapping and get the final dataframe that is the union of all three plus the identifier. My idea:
map(/*get identifier here along with dataframe*/).reduce(_ union _).select(identifier +: columns:_*)
The final result would be something like:
-------------------------------
|identifier | product |State |
-------------------------------
| customer1| prod1 | VA |
| customer1| prod132 | VA |
| customer2| prod32 | CA |
| customer2| prod51 | CA |
| customer2| prod21 | AL |
| customer2| prod52 | AL |
-------------------------------
You could use collect to unnest Option[Map[String, Dataframe]] to Map[String, DataFrame]. To put an identifier into the column you should use withColumn. So your code could look like:
import org.apache.spark.sql.functions.lit
val result: DataFrame = frames.collect {
case Some(m) =>
m.map {
case (identifier, dataframe) => dataframe.withColumn("identifier", lit(identifier))
}.reduce(_ union _)
}.reduce(_ union _)
Something like this perhaps?
list
.flatten
.flatMap {
_.map { case (id, df) =>
df.withColumn("identifier", id) }
}.reduce(_ union _)

Replacing regex pattern with another string works, but replacing with NONE replaces all values

I am trying to replace all strings in a column that start with 'DEL_' with a NULL value.
I have tried this:
customer_details = customer_details.withColumn("phone_number", F.regexp_replace("phone_number", "DEL_.*", ""))
Which works as expected and the new column now looks like this:
+--------------+
| phone_number|
+--------------+
|00971585059437|
|00971559274811|
|00971559274811|
| |
|00918472847271|
| |
+--------------+
However, if I change the code to:
customer_details = customer_details.withColumn("phone_number", F.regexp_replace("phone_number", "DEL_.*", None))
This now replaces all values in the column:
+------------+
|phone_number|
+------------+
| null|
| null|
| null|
| null|
| null|
| null|
+------------+
Try this-
scala
df.withColumn("phone_number", when(col("phone_number").rlike("^DEL_.*"), null)
.otherwise(col("phone_number"))
)
python
df.withColumn("phone_number", when(col("phone_number").rlike("^DEL_.*"), None)
.otherwise(col("phone_number"))
)
Update
Query-
Can you explain why my original solution doesn't work? customer_details.withColumn("phone_number", F.regexp_replace("phone_number", "DEL_.*", None))
Ans- All the ternary expressions(functions taking 3 arguments) are all null-safe. That means if spark finds any of the arguments null, it will indeed return null without any actual processing (eg. pattern matching for regexp_replace).
you may wanted to look at this piece of spark repo
override def eval(input: InternalRow): Any = {
val exprs = children
val value1 = exprs(0).eval(input)
if (value1 != null) {
val value2 = exprs(1).eval(input)
if (value2 != null) {
val value3 = exprs(2).eval(input)
if (value3 != null) {
return nullSafeEval(value1, value2, value3)
}
}
}
null
}

How to create a business logic for regular expression and save data to csv file

We have .txt log file , i used scala spark to read the file. the file contains sets of data in row wise . i read the data one by one like as below
val sc = spark.SparkContext
val dataframe = sc.textFile(/path/to/log/*.txt)
We have .txt log file , i used scala spark to read the file. the file contains sets of data in row wise . i read the data one by one like as below
val sc = spark.SparkContext
val dataframe = sc.textFile(/path/to/log/*.txt)
val get_set_element = sc.textFile(filepath.txt)
val pattern = """(\S+) "([\S\s]+)\" (\S+) (\S+) (\S+) (\S+)""".r
val test = get_set_element.map{ line =>
( for {
m <- pattern.findAllIn(line).matchData
g <- m.subgroups
} yield(g)
).toList
}.
map(l => (l(0), l(1), l(2), l(3), l(4), l(5)))
I want to create a DataFrame so that i can save it into csv file.
Can be created from RDD[Row], with schema assigned:
// instead of: map(l => (l(0), l(1), l(2), l(3), l(4), l(5)))
.map(Row.fromSeq)
val fields = (0 to 5).map(idx => StructField(name = "l" + idx, dataType = StringType, nullable = true))
val df = spark.createDataFrame(test, StructType(fields))
Output:
+---+---+---+---+---+---+
|l0 |l1 |l2 |l3 |l4 |l5 |
+---+---+---+---+---+---+
|a |b |c |d |e |f |
+---+---+---+---+---+---+

Using Pyspark, the regex between two characters gets text and numbers, but not date

Using Pyspark regex_extract() I can substring between two characters in the string. It is grabbing the text and numbers, but is not grabbing the dates.
data = [('2345', '<Date>1999/12/12 10:00:05</Date>'),
('2398', '<Crew>crewIdXYZ</Crew>'),
('2328', '<Latitude>0.8252644369443788</Latitude>'),
('3983', '<Longitude>-2.1915840465066916<Longitude>')]
df = sc.parallelize(data).toDF(['ID', 'values'])
df.show(truncate=False)
+----+-----------------------------------------+
|ID |values |
+----+-----------------------------------------+
|2345|<Date>1999/12/12 10:00:05</Date> |
|2398|<Crew>crewIdXYZ</Crew> |
|2328|<Latitude>0.8252644369443788</Latitude> |
|3983|<Longitude>-2.1915840465066916<Longitude>|
+----+-----------------------------------------+
df_2 = df.withColumn('vals', regexp_extract(col('values'), '(.)((?<=>)[^<:]+(?=:?<))', 2))
df_2.show(truncate=False)
+----+-----------------------------------------+-------------------+
|ID |values |vals |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date> | |
|2398|<Crew>crewIdXYZ</Crew> |crewIdXYZ |
|2328|<Latitude>0.8252644369443788</Latitude> |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+
What can I add to the regex statement to get the date as well?
#jxc Thanks. Here is what made it work:
df_2 = df.withColumn('vals', regexp_extract(col('values'), '(.)((?<=>)[^>]+(?=:?<))', 2))
df_2.show(truncate=False)
+----+-----------------------------------------+-------------------+
|ID |values |vals |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date> |1999/12/12 10:00:05|
|2398|<Crew>crewIdXYZ</Crew> |crewIdXYZ |
|2328|<Latitude>0.8252644369443788</Latitude> |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+
You may use
>([^<>]+)<
See the regex demo. The regex matches a >, then captures into Group 1 any one or more chars other than < and >, and then just matches >. The ncol argument should be set to 1 since the value you need is in Group 1:
df_2 = df.withColumn('vals', regexp_extract(col('values'), '>([^<>]+)<', 1))
df_2.show(truncate=False)
+----+-----------------------------------------+-------------------+
|ID |values |vals |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date> |1999/12/12 10:00:05|
|2398|<Crew>crewIdXYZ</Crew> |crewIdXYZ |
|2328|<Latitude>0.8252644369443788</Latitude> |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+

How to get rows where a field contains ( ) , [ ] % or +. using rlike SparkSQL function

Let's say you have a Spark dataframe with multiple columns and you want to return the rows where the columns contains specific characters. Specifically you want to return the rows where at least one of the fields contains ( ) , [ ] % or +.
What is the proper syntax in case you want to use Spark SQL rlike function?
import spark.implicits._
val dummyDf = Seq(("John[", "Ha", "Smith?"),
("Julie", "Hu", "Burol"),
("Ka%rl", "G", "Hu!"),
("(Harold)", "Ju", "Di+")
).toDF("FirstName", "MiddleName", "LastName")
dummyDf.show()
+---------+----------+--------+
|FirstName|MiddleName|LastName|
+---------+----------+--------+
| John[| Ha| Smith?|
| Julie| Hu| Burol|
| Ka%rl| G| Hu!|
| (Harold)| Ju| Di+|
+---------+----------+--------+
Expected Output
+---------+----------+--------+
|FirstName|MiddleName|LastName|
+---------+----------+--------+
| John[| Ha| Smith?|
| Ka%rl| G| Hu!|
| (Harold)| Ju| Di+|
+---------+----------+--------+
My few attempts returns errors or not what expected even when I try to do it just for searching (.
I know that I could use the simple like construct multiple times, but I am trying to figure out to do it in a more concise way with regex and Spark SQL.
You can try this using rlike method:
dummyDf.show()
+---------+----------+--------+
|FirstName|MiddleName|LastName|
+---------+----------+--------+
| John[| Ha| Smith?|
| Julie| Hu| Burol|
| Ka%rl| G| Hu!|
| (Harold)| Ju| Di+|
| +Tim| Dgfg| Ergf+|
+---------+----------+--------+
val df = dummyDf.withColumn("hasSpecial",lit(false))
val result = df.dtypes
.collect{ case (dn, dt) => dn }
.foldLeft(df)((accDF, c) => accDF.withColumn("hasSpecial", col(c).rlike(".*[\\(\\)\\[\\]%+]+.*") || col("hasSpecial")))
result.filter(col("hasSpecial")).show(false)
Output:
+---------+----------+--------+----------+
|FirstName|MiddleName|LastName|hasSpecial|
+---------+----------+--------+----------+
|John[ |Ha |Smith? |true |
|Ka%rl |G |Hu! |true |
|(Harold) |Ju |Di+ |true |
|+Tim |Dgfg |Ergf+ |true |
+---------+----------+--------+----------+
You can also drop the hasSpecial column if you want.
Try this .*[()\[\]%\+,.]+.*
.* all character zero or more times
[()[]%+,.]+ all characters inside bracket 1 or more times
.* all character zero or more times