Using regexp to join two dataframes in spark - regex

Say I have a dataframe df1 with the column "color" that contains a bunch of colors, and another dataframe df2 with column "phrase" that contains various phrases.
I'd like to join the two dataframes where the color in d1 appears in phrases in d2. I cannot use d1.join(d2, d2("phrases").contains(d1("color")), since it would join on anywhere the word appears within the phrase. I don't want to match on words like scaRED for example, where RED is a part of another word. I only want to join when the color appears as a seperate word in the phrases.
Can I use a regular expression to solve this? What function can I use and how is the syntax when I need to reference the column in the expression?

You could create a REGEX pattern that checks for word boundaries (\b) when matching colors and use a regexp_replace check as the join condition:
val df1 = Seq(
(1, "red"), (2, "green"), (3, "blue")
).toDF("id", "color")
val df2 = Seq(
"red apple", "scared cat", "blue sky", "green hornet"
).toDF("phrase")
val patternCol = concat(lit("\\b"), df1("color"), lit("\\b"))
df1.join(df2, regexp_replace(df2("phrase"), patternCol, lit("")) =!= df2("phrase")).
show
// +---+-----+------------+
// | id|color| phrase|
// +---+-----+------------+
// | 1| red| red apple|
// | 3| blue| blue sky|
// | 2|green|green hornet|
// +---+-----+------------+
Note that "scared cat" would have been a match in the absence of the enclosed word boundaries.

Building up on your own solution, you can also try this:
d1.join(d2, array_contains(split(d2("phrases"), " "), d1("color")))

Did not see your data but this is a starter, with a little variation. No need for regex as far as I can see, but who knows:
// You need to do some parsing like stripping of . ? and may be lowercase or uppercase
// You did not provide an example on the JOIN
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
val checkValue = udf { (array: WrappedArray[String], value: String) => array.iterator.map(_.toLowerCase).contains(value.toLowerCase() ) }
//Gen some data
val dfCompare = spark.sparkContext.parallelize(Seq("red", "blue", "gold", "cherry")).toDF("color")
val rdd = sc.parallelize( Array( (("red","hello how are you red",10)), (("blue", "I am fine but blue",20)), (("cherry", "you need to do some parsing and I like cherry",30)), (("thebluephantom", "you need to do some parsing and I like fanta",30)) ))
//rdd.collect
val df = rdd.toDF()
val df2 = df.withColumn("_4", split($"_2", " "))
df2.show(false)
dfCompare.show(false)
val res = df2.join(dfCompare, checkValue(df2("_4"), dfCompare("color")), "inner")
res.show(false)
returns:
+------+---------------------------------------------+---+--------------------------------------------------------+------+
|_1 |_2 |_3 |_4 |color |
+------+---------------------------------------------+---+--------------------------------------------------------+------+
|red |hello how are you red |10 |[hello, how, are, you, red] |red |
|blue |I am fine but blue |20 |[I, am, fine, but, blue] |blue |
|cherry|you need to do some parsing and I like cherry|30 |[you, need, to, do, some, parsing, and, I, like, cherry]|cherry|
+------+---------------------------------------------+---+--------------------------------------------------------+------+

Related

How do I replace a delimiter that appears only in between something?

I have a use case with this data:
1. "apple+case"
2. "apple+case+10+cover"
3. "apple+case+10++cover"
4. "+apple"
5. "iphone8+"
Currently, I am doing this to replace the + with space as follows:
def normalizer(value: String): String = {
if (value == null) {
null
} else {
value.replaceAll("\\+", BLANK_SPACE)
}
}
val testUDF = udf(normalizer(_: String): String)
df.withColumn("newCol", testUDF($"value"))
But this is replacing all "+". How do I replace "+" that comes between strings while also handling use cases like: "apple+case+10++cover" => "apple case 10+ cover"?
The output should be
1. "apple case"
2. "apple case 10 cover"
3. "apple case 10+ cover"
4. "apple"
5. "iphone8+"
You can use regexp_replace to do this instead of a udf, it should be much faster. For most of the cases, you can use negative lookahead in the regexp, but for "+apple" you actually want to replace "+" with "" (and not a space). The easiest way is to simply use to regexps.
df.withColumn("newCol", regexp_replace($"value", "^\\+", ""))
.withColumn("newCol", regexp_replace($"newCol", "\\+(?!\\+|$)", " "))
This will give:
+--------------------+--------------------+
|value |newCol |
+--------------------+--------------------+
|apple+case |apple case |
|apple+case+10+cover |apple case 10 cover |
|apple+case+10++cover|apple case 10+ cover|
|+apple |apple |
|iphone8+ |iphone8+ |
+--------------------+--------------------+
To make this more modular and reusable, you can define it as a function:
def normalizer(c: String) = regexp_replace(regexp_replace(col(c), "^\\+", ""), "\\+(?!\\+|$)", " ")
df.withColumn("newCol", normalizer("value"))
You may try making two regex replacements:
df.withColumn("newCol", regexp_replace(
regexp_replace(testUDF("value"), "(?<=\d)\+(?!\+)", "+ "),
"(?<!\d)\+", " ")).show
The inner regex replacement would target the edge case of single plus preceded by a digit, which should be replaced by adding a space (but not deleting the plus). Example:
apple+case+10+cover --> apple+case+10+ cover
The outer regex replacement then targets all pluses which are not preceded by a digit, and replaces them with a space. Example, continuing from above:
apple+case+10+ cover --> apple case 10+ cover

Extract words from a string column in spark dataframe

I have a column in spark dataframe which has text.
I want to extract all the words which start with a special character '#' and I am using regexp_extract from each row in that text column. If the text contains multiple words starting with '#' it just returns the first one.
I am looking for extracting multiple words which match my pattern in Spark.
data_frame.withColumn("Names", regexp_extract($"text","(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9_]+)",1).show
Sample input: #always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking
Sample output: #always_nidhi,#YouTube
You can create a udf function in spark as below:
import java.util.regex.Pattern
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.lit
def regexp_extractAll = udf((job: String, exp: String, groupIdx: Int) => {
println("the column value is" + job.toString())
val pattern = Pattern.compile(exp.toString)
val m = pattern.matcher(job.toString)
var result = Seq[String]()
while (m.find) {
val temp =
result =result:+m.group(groupIdx)
}
result.mkString(",")
})
And then call the udf as below:
data_frame.withColumn("Names", regexp_extractAll(new Column("text"), lit("#\\w+"), lit(0))).show()
Above you give you output as below:
+--------------------+
| Names|
+--------------------+
|#always_nidhi,#Yo...|
+--------------------+
I have used regex, as per the output you have posted in the question. You can modify it to suite your needs.
You can use java RegEx to extract those words. Below is the working code.
val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern
//User Defined function to extract
def toExtract(str: String) = {
val pattern = Pattern.compile("#\\w+")
val tmplst = scala.collection.mutable.ListBuffer.empty[String]
val matcher = pattern.matcher(str)
while (matcher.find()) {
tmplst += matcher.group()
}
tmplst.mkString(",")
}
val Extract = udf(toExtract _)
val values = List("#always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()
Output
+--------------------+
| UDF(words)|
+--------------------+
|#always_nidhi,#Yo...|
+--------------------+
In Spark 3.1+ it's possible using regexp_extract_all
Test with your input:
import spark.implicits._
var df = Seq(
("#always_nidhi #YouTube no"),
("#always_nidhi"),
("no")
).toDF("text")
val col_re_list = expr("regexp_extract_all(text, '(?<=^|(?<=[^a-zA-Z0-9-_\\\\.]))#([A-Za-z]+[A-Za-z0-9_]+)', 0)")
df.withColumn("Names", array_join(col_re_list, ", ")).show(false)
// +-------------------------+-----------------------+
// |text |Names |
// +-------------------------+-----------------------+
// |#always_nidhi #YouTube no|#always_nidhi, #YouTube|
// |#always_nidhi |#always_nidhi |
// |no | |
// +-------------------------+-----------------------+
array_join is used, because you wanted results to be in string format while regexp_extract_all returns array.
if you use \ for escaping in your pattern, you will need to use \\\\ instead of \, until regexp_extract_all is available directly without expr.
I took the suggestion of Amit Kumar and created a UDF and then ran it in Spark SQL:
select Words(status) as people from dataframe
Words is my UDF and status is my dataframe column.

Difference between (^|\\s)([A-Z]{1,3})(\\s|$) and \\b[A-Z]{1,2}\\b regular expressions in R

I'm trying clean some small strings (1-3 letters) stored in a column from R Data Frame. Specifically, suppose the next R Script:
df = data.frame( "original" = c("ABCDE FG H",
"IJKL MN OPQRS",
"TUV WX YZ AAAA"))
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}($|\\s)", " ", df$original)
df$filter2 = gsub("\\b[A-Z]{1,2}\\b", " ", df$original)
> df
original | filter1 | filter2 |
1 ABCDE FG H | ABCDE H | ABCDE |
2 IJKL MN OPQRS | IJKL OPQRS | IJKL OPQRS|
3 TUV WX YZ AAAA | TUV YZ AAAA| TUV AAAA |
I don't understand why the first filter (^|\\s)[A-Z]{1,2}($|\\s) doesn't replace "H" in the first row or "YZ" in the third one. I would expect the same result that using \\b[A-Z]{1,2}\\b as filter (filter2 column). Please don't worry about multiple spaces, it isn't important for me (unless this would be the problem :)).
I thought that the problem is the "globality" of operation, that it's, if it finds the first one not replace the second one, but it isn't TRUE if I do the next replacement:
> gsub("A", "X", "AAAABBBBCCCDDDDAAAAAAAEEE")
[1] "XXXXBBBBCCCDDDDXXXXXXXEEE"
So, Why are the results different?
The point is that gsub can only match non-overlapping strings. FG being the first expected match, and H the second, you can see that these strings overlap, and thus, after "(^|\\s)[A-Z]{1,2}($|\\s)" consumes the trailing space after FG, H just does not match the pattern.
Look: ABCDE FG H is analyzed from left to right. The expression matches FG , and the regex index is right before H. There is only this letter to match, but (^|\s) requires a space or the start of string - there is none at this location.
To "fix" this and use the same logic, you can use a PCRE regex gsub with lookarunds:
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}(?=$|\\s)", " ", df$original, perl=TRUE)
or
df$filter1 = gsub("(?<!\\S)[A-Z]{1,2}(?!\\S)", " ", df$original, perl=TRUE)
and if you need to actually consume (to remove) spaces, just add \\s* before (or/and after).
The second expression "\\b[A-Z]{1,2}\\b" contains word boundaries, and they are zero-width assertions that do not consume text, thus, the regex engine can match both FG and H since the spaces are not consumed.

R: Can grep() include more than one pattern?

For instance, in this example, I would like to remove the elements in text that contain http and america.
> text <- c("One word#", "112a httpSentenceamerica", "you and meamerica", "three two one")
Hence, I would use the logical operator, |.
> pattern <- "http|america"
Which works because this is considered to be one pattern.
> grep(pattern, text, invert = TRUE, value = TRUE)
[1] "One word#" "three two one"
What if I have a long list of words that I would like to use in the pattern? How can I do it? I don't think I can keep on using the logical operators a lot of times.
Thank you in advance!
Generally, as #akrun said:
text <- c("One word#", "112a httpSentenceamerica", "you and meamerica", "three two one")
pattern = c("http", "america")
grep(paste(pattern, collapse = "|"), text, invert = TRUE, value = TRUE)
# [1] "One word#" "three two one"
You wrote that your list of words is "long." This solution doesn't scale indefinitely, unsurprisingly:
long_pattern = paste(rep(pattern, 1300), collapse = "|")
nchar(long_pattern)
# [1] 16899
grep(long_pattern, text, invert = TRUE, value = TRUE)
# Error in grep(long_pattern, text, invert = TRUE, value = TRUE) :
But if necessary, you could MapReduce, starting with something along the lines of:
text[Reduce(`&`, Map(function(p) !grepl(p, text), long_pattern))]
# [1] "One word#" "three two one"

Looping through a list of regex matches and grabbing the first capture group in a loop

I am trying to loop through regex results, and insert the first capture group into a variable to be processed in a loop. But I can't figure out how to do so. Here's what I have so far, but it just prints the second match:
aQuote = "The big boat has a big assortment of big things."
theMatches = regmatches(aQuote, gregexpr("big ([a-z]+)", aQuote ,ignore.case = TRUE))
results = lapply(theMatches, function(m){
capturedItem = m[[2]]
print(capturedItem)
})
Right now it prints
[1] "big assortment"
What I want it to print is
[1] boat
[1] assortment
[1] things
Try this:
regmatches(aQuote, gregexpr("(?<=big )[a-z]+", aQuote ,ignore.case = TRUE,perl=TRUE))[[1]]
#[1] "boat" "assortment" "things"
Include g (global) modifier in you code as well.
Equivalent regex in perl / javascript is: /big ([a-z]+)/ig
Sample perl prog:
$aQuote = "The big boat has a big assortment of big things.";
print $1."\n" while ($aQuote =~ /big ([a-z]+)/ig);
JS Fiddle here.
Edit: In r, we can write:
aQuote = "The big boat has a big assortment of big things."
theMatches = regmatches(aQuote, gregexpr("big ([a-z]+)", aQuote ,ignore.case = TRUE))
results = lapply(theMatches, function(m){
len= length(m)
for (i in 1:len)
{
print(m[[i]])
}
})
r fiddle here.