regex of json string in data frame using spark scala - regex

I am having trouble retrieving a value from a JSON string using regex in spark.
My pattern is:
val st1 = """id":"(.*?)"""
val pattern = s"${'"'}$st1${'"'}"
//pattern is: "id":"(.*?)"
My test string in a DF is
import spark.implicits._
val jsonStr = """{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"}"""
val df = sqlContext.sparkContext.parallelize(Seq(jsonStr)).toDF("request")
I am then trying to parse out the id value and add it to the df through a UDF like so:
def getSubStringGroup(pattern: String) = udf((request: String) => {
val patternWithResponseRegex = pattern.r
var subString = request match {
case patternWithResponseRegex(idextracted) => Array(idextracted)
case _ => Array("na")
}
subString
})
val dfWithIdExtracted = df.select($"request")
.withColumn("patternMatchGroups", getSubStringGroup(pattern)($"request"))
.withColumn("idextracted", $"patternMatchGroups".getItem(0))
.drop("patternMatchGroups")
So I want my df to look like
|------------------------------------------------------------- | ------------------------|
| request | id |
|------------------------------------------------------------- | ------------------------|
|{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"} | 1d5482864c60d5bd07919490|
| -------------------------------------------------------------|-------------------------|
However, when I try the above method, my match comes back as "null" despite working on regex101.com
Could anyone advise or suggest a different method? Thank you.
Following Krzysztof's solution, my table now looks like so:
|------------------------------------------------------------- | ------------------------|
| request | id |
|------------------------------------------------------------- | ------------------------|
|{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"} | "id":"1d5482864c60d5bd07919490"|
| -------------------------------------------------------------|-------------------------|
I created a new udf to trim the unnecessary characters and added it to the df:
def trimId = udf((idextracted: String) => {
val id = idextracted.drop(6).dropRight(1)
id
})
val dfWithIdExtracted = df.select($"request")
.withColumn("patternMatchGroups", getSubStringGroup(pattern)($"request"))
.withColumn("idextracted", $"patternMatchGroups".getItem(0))
.withColumn("id", trimId($"idextracted"))
.drop("patternMatchGroups", "idextracted")
The df now looks as desired. Thanks again Krzysztof!

When you're using pattern matching with regex, you're trying to match whole string, which obviously can't succeed. You should rather use findFirstMatchIn:
def getSubStringGroup(pattern: String) = udf((request: String) => {
val patternWithResponseRegex = pattern.r
patternWithResponseRegex.findFirstIn(request).map(Array(_)).getOrElse(Array("na"))
})
You're also creating your pattern in a very bizarre way unless you've got special use case for it. You could just do:
val pattern = """"id":"(.*?)""""

Related

Scala regex get string before the first hyphen and the entire string

Given a string like abab/docId/example-doc1-2019-01-01, I want to use Regex to extract these values:
firstPart = example
fullString = example-doc1-2019-01-01
I have this:
import scala.util.matching.Regex
case class Read(theString: String) {
val stringFormat: Regex = """.*\/docId\/([A-Za-z0-9]+)-([A-Za-z0-9-]+)$""".r
val stringFormat(firstPart, fullString) = theString
}
But this separates it like this:
firstPart = example
fullString = doc1-2019-01-01
Is there a way to retain the fullString and do a regex on that to get the part before the first hyphen? I know I can do this using the String split method but is there a way do it using regex?
You may use
val stringFormat: Regex = ".*/docId/(([A-Za-z0-9])+-[A-Za-z0-9-]+)$".r
||_ Group 2 _| |
| |
|_________________ Group 1 __|
See the regex demo.
Note how capturing parentheses are re-arranged. Also, you need to swap the variables in the regex match call, see demo below (fullString should come before firstPart).
See Scala demo:
val theString = "abab/docId/example-doc1-2019-01-01"
val stringFormat = ".*/docId/(([A-Za-z0-9]+)-[A-Za-z0-9-]+)".r
val stringFormat(fullString, firstPart) = theString
println(s"firstPart: '$firstPart'\nfullString: '$fullString'")
Output:
firstPart: 'example'
fullString: 'example-doc1-2019-01-01'

Extract words from a string column in spark dataframe

I have a column in spark dataframe which has text.
I want to extract all the words which start with a special character '#' and I am using regexp_extract from each row in that text column. If the text contains multiple words starting with '#' it just returns the first one.
I am looking for extracting multiple words which match my pattern in Spark.
data_frame.withColumn("Names", regexp_extract($"text","(?<=^|(?<=[^a-zA-Z0-9-_\.]))#([A-Za-z]+[A-Za-z0-9_]+)",1).show
Sample input: #always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking
Sample output: #always_nidhi,#YouTube
You can create a udf function in spark as below:
import java.util.regex.Pattern
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.lit
def regexp_extractAll = udf((job: String, exp: String, groupIdx: Int) => {
println("the column value is" + job.toString())
val pattern = Pattern.compile(exp.toString)
val m = pattern.matcher(job.toString)
var result = Seq[String]()
while (m.find) {
val temp =
result =result:+m.group(groupIdx)
}
result.mkString(",")
})
And then call the udf as below:
data_frame.withColumn("Names", regexp_extractAll(new Column("text"), lit("#\\w+"), lit(0))).show()
Above you give you output as below:
+--------------------+
| Names|
+--------------------+
|#always_nidhi,#Yo...|
+--------------------+
I have used regex, as per the output you have posted in the question. You can modify it to suite your needs.
You can use java RegEx to extract those words. Below is the working code.
val sparkConf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.functions.{col, udf}
import java.util.regex.Pattern
//User Defined function to extract
def toExtract(str: String) = {
val pattern = Pattern.compile("#\\w+")
val tmplst = scala.collection.mutable.ListBuffer.empty[String]
val matcher = pattern.matcher(str)
while (matcher.find()) {
tmplst += matcher.group()
}
tmplst.mkString(",")
}
val Extract = udf(toExtract _)
val values = List("#always_nidhi #YouTube no i dnt understand bt i loved the music nd their dance awesome all the song of this mve is rocking")
val df = sc.parallelize(values).toDF("words")
df.select(Extract(col("words"))).show()
Output
+--------------------+
| UDF(words)|
+--------------------+
|#always_nidhi,#Yo...|
+--------------------+
In Spark 3.1+ it's possible using regexp_extract_all
Test with your input:
import spark.implicits._
var df = Seq(
("#always_nidhi #YouTube no"),
("#always_nidhi"),
("no")
).toDF("text")
val col_re_list = expr("regexp_extract_all(text, '(?<=^|(?<=[^a-zA-Z0-9-_\\\\.]))#([A-Za-z]+[A-Za-z0-9_]+)', 0)")
df.withColumn("Names", array_join(col_re_list, ", ")).show(false)
// +-------------------------+-----------------------+
// |text |Names |
// +-------------------------+-----------------------+
// |#always_nidhi #YouTube no|#always_nidhi, #YouTube|
// |#always_nidhi |#always_nidhi |
// |no | |
// +-------------------------+-----------------------+
array_join is used, because you wanted results to be in string format while regexp_extract_all returns array.
if you use \ for escaping in your pattern, you will need to use \\\\ instead of \, until regexp_extract_all is available directly without expr.
I took the suggestion of Amit Kumar and created a UDF and then ran it in Spark SQL:
select Words(status) as people from dataframe
Words is my UDF and status is my dataframe column.

How to pull string value in url using scala regex?

I have below urls in my applications, I want to take one of the value in urls.
For example:
rapidvie value 416
Input URL: http://localhost:8080/bladdey/shop/?rapidView=416&projectKey=DSCI&view=detail&
Output should be: 416
I've written the code in scala using import java.util.regex.{Matcher, Pattern}
val p: Pattern = Pattern.compile("[?&]rapidView=(\\d+)[?&]")**strong text**
val m:Matcher = p.matcher(url)
if(m.find())
println(m.group(1))
I am getting output, but i want to migrate this scala using scala.util.matching library.
How to implement this in simply?
This code is working with java utils.
In Scala, you may use an unanchored regex within a match block to get just the captured part:
val s = "http://localhost:8080/bladdey/shop/?rapidView=416&projectKey=DSCI&view=detail&"
val pattern ="""[?&]rapidView=(\d+)""".r.unanchored
val res = s match {
case pattern(rapidView) => rapidView
case _ => ""
}
println(res)
// => 416
See the Scala demo
Details:
"""[?&]rapidView=(\d+)""".r.unanchored - the triple quoted string literal allows using single backslashes with regex escapes, and the .unanchored property makes the regex match partially, not the entire string
pattern(rapidView) gets the 1 or more digits part (captured with (\d+)) if a pattern finds a partial match
case _ => "" will return an empty string upon no match.
You can do this quite easily with Scala:
scala> val url = "http://localhost:8080/bladdey/shop/?rapidView=416&projectKey=DSCI&view=detail&"
url: String = http://localhost:8080/bladdey/shop/?rapidView=416&projectKey=DSCI&view=detail&
scala> url.split("rapidView=").tail.head.split("&").head
res0: String = 416
You can also extend it by parameterize the search word:
scala> def searchParam(sp: String) = sp + "="
searchParam: (sp: String)String
scala> val sw = "rapidView"
sw: String = rapidView
And just search with the parameter name
scala> url.split(searchParam(sw)).tail.head.split("&").head
res1: String = 416
scala> val sw2 = "projectKey"
sw2: String = projectKey
scala> url.split(searchParam(sw2)).tail.head.split("&").head
res2: String = DSCI

regular expression matching string in scala

I have a string like this
result: String = /home/administrator/com.supai.common-api-1.8.5-DEV- SNAPPSHOT/com/a/infra/UserAccountDetailsMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV- SNAPSHOT/com/a/infra/UserAccountDetailsMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserAccountMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserAccountMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenFunctionMetaDataMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenFunctionMetaDataMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenPermissionMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenPermissionMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserRoleMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserRoleMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV- SNAPSHOT/com/a/infra/VendorAddressMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/VendorAddressMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/reactore/infra/VendorContactMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/reactore/infra/VendorContactMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/reactore/infra/VendorMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/VendorMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WeekMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WeekMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowMetadataMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowMetadataMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowNotificationMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowNotificationMetaData.class
/home/a/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/a/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/home/common/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/raghav/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/sysadmin/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/tmp/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
regex: scala.util.matching.Regex = (\\/([u|s|r])\\/([s|h|a|r|e]))
x: scala.util.matching.Regex.MatchIterator = empty iterator`
and out of this how can I get only this part /usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jarand this part can be anywhere in the string, how can I achieve this, I tried using regular expression in Scala but don't know how to use forward slashes, so anybody plz explain how to do this in scala.
What is your search criteria? Your pattern seems to be wrong.
In your rexexp, I see u|s|r which means to search for either u, or s or r . See here for more information
how can I get only this part
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jarand
this part can be anywhere in the string
If you are looking for a path, see the below example:
scala> val input = """/home/common/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/raghav/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/sysadmin/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/tmp/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
| /usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar"""
input: String =
/home/common/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/raghav/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/sysadmin/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/tmp/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
scala> val myRegExp = "/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar".r
myRegExp: scala.util.matching.Regex = /usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
scala> val myRegExp2 = "helloWorld.jar".r
myRegExp2: scala.util.matching.Regex = helloWorld.jar
scala> (myRegExp findAllIn input) foreach( println)
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
scala> (myRegExp2 findAllIn input) foreach( println)
scala>

Removing diacritics in Scala

The problem is trivial, taking a string in some language remove the diacritics symbols. For example taking "téléphone" produces the result "telephone".
In Java I can use such method:
public static String removeAccents(String str){
return Normalizer.normalize(str, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
and it works fine but in scala it doesn't... I tried the code as follows:
val str = Normalizer.normalize("téléphone",Normalizer.Form.NFD)
val exp = "\\p{InCombiningDiacriticalMarks}+".r
exp.replaceAllIn(str,"")
it does't work!
I think, I'm missing something in using Regex in Scala, so any help would be appreciated.
I came across this same issue using Normalizer. Found a solution from Apache Commons StringUtils in the form of stripAccents, which removes diacitics from a String.
val str = stripAccents("téléphone")
println(str)
This will yield "telephone". Hope this helps someone!
You can use this, create a function to return the value of stripAccents.
val spark=SparkBase.getSparkSession()
val sc=spark.sparkContext
import spark.implicits._
val str = stripAccents("téléphone")
println(str)
val str2 = stripAccents("SERNAQUE ARGÜELLO NORMA ELIZABETH")
println(str2)
case class Fruits(name: String, quantity: Int)
val sourceDS = Seq(("YÁBAR ARRIETA JENSON", 1), ("SERNAQUE ARGÜELLO NORMA ELIZABETH", 2)).toDF("text","num")
val check = udf((colValue: String) => {
stripAccents(colValue)
})
sourceDS.select(col("text"),check(col("text"))).show(false)
->OUTPUT
+---------------------------------+---------------------------------+
|text |UDF(text) |
+---------------------------------+---------------------------------+
|YÁBAR ARRIETA JENSON |YABAR ARRIETA JENSON |
|SERNAQUE ARGÜELLO NORMA ELIZABETH|SERNAQUE ARGUELLO NORMA ELIZABETH|
+---------------------------------+---------------------------------+