Removing diacritics in Scala - regex

The problem is trivial, taking a string in some language remove the diacritics symbols. For example taking "téléphone" produces the result "telephone".
In Java I can use such method:
public static String removeAccents(String str){
return Normalizer.normalize(str, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
and it works fine but in scala it doesn't... I tried the code as follows:
val str = Normalizer.normalize("téléphone",Normalizer.Form.NFD)
val exp = "\\p{InCombiningDiacriticalMarks}+".r
exp.replaceAllIn(str,"")
it does't work!
I think, I'm missing something in using Regex in Scala, so any help would be appreciated.

I came across this same issue using Normalizer. Found a solution from Apache Commons StringUtils in the form of stripAccents, which removes diacitics from a String.
val str = stripAccents("téléphone")
println(str)
This will yield "telephone". Hope this helps someone!

You can use this, create a function to return the value of stripAccents.
val spark=SparkBase.getSparkSession()
val sc=spark.sparkContext
import spark.implicits._
val str = stripAccents("téléphone")
println(str)
val str2 = stripAccents("SERNAQUE ARGÜELLO NORMA ELIZABETH")
println(str2)
case class Fruits(name: String, quantity: Int)
val sourceDS = Seq(("YÁBAR ARRIETA JENSON", 1), ("SERNAQUE ARGÜELLO NORMA ELIZABETH", 2)).toDF("text","num")
val check = udf((colValue: String) => {
stripAccents(colValue)
})
sourceDS.select(col("text"),check(col("text"))).show(false)
->OUTPUT
+---------------------------------+---------------------------------+
|text |UDF(text) |
+---------------------------------+---------------------------------+
|YÁBAR ARRIETA JENSON |YABAR ARRIETA JENSON |
|SERNAQUE ARGÜELLO NORMA ELIZABETH|SERNAQUE ARGUELLO NORMA ELIZABETH|
+---------------------------------+---------------------------------+

Related

transform string scala in an elegant way

I have the following input string: val s = 19860803 000000
I want to convert it to 1986/08/03
I tried this s.split(" ").head, but this is not complete
is there any elegant scala coding way with regex to get the expected result ?
You can use a date like pattern using 3 capture groups, and match the following space and the 6 digits.
In the replacement use the 3 groups in the replacement with the forward slashes.
val s = "19860803 000000"
val result = s.replaceAll("^(\\d{4})(\\d{2})(\\d{2})\\h\\d{6}$", "$1/$2/$3")
Output
result: String = 1986/08/03
i haven't tested this, but i think the following will work
val expr = raw"(\d{4})(\d{2})(\d{2}) (.*)".r
val formatted = "19860803 000000" match {
case expr(year,month,day,_) =>. s"$year/$month/$day"
}
scala docs have a lot of good info
https://www.scala-lang.org/api/2.13.6/scala/util/matching/Regex.html
An alternative, without a regular expression, by using slice and take.
val s = "19860803 000000"
val year = s.take(4)
val month = s.slice(4,6)
val day = s.slice(6,8)
val result = s"$year/$month/$day"
Or as a one liner
val result = Seq(s.take(4), s.slice(4,6), s.slice(6,8)).mkString("/")

How to use match with regular expressions in Scala

I am starting to learn Scala and want to use regular expressions to match a character from a string so I can populate a mutable map of characters and their value (String values, numbers etc) and then print the result.
I have looked at several answers on SO and gone over the Scala Docs but can't seem to get this right. I have a short Lexer class that currently looks like this:
class Lexer {
private val tokens: mutable.Map[String, Any] = collection.mutable.Map()
private def checkCharacter(char: Character): Unit = {
val Operator = "[-+*/^%=()]".r
val Digit = "[\\d]".r
val Other = "[^\\d][^-+*/^%=()]".r
char.toString match {
case Operator(c) => tokens(c) = "Operator"
case Digit(c) => tokens(c) = Integer.parseInt(c)
case Other(c) => tokens(c) = "Other" // Temp value, write function for this
}
}
def lex(input: String): Unit = {
val inputArray = input.toArray
for (s <- inputArray)
checkCharacter(s)
for((key, value) <- tokens)
println(key + ": " + value)
}
}
I'm pretty confused by the sort of strange method syntax, Operator(c), that I have seen being used to handle the value to match and am also unsure if this is the correct way to use regex in Scala. I think what I want this code to do is clear, I'd really appreciate some help understanding this. If more info is needed I will supply what I can
This official doc has lot's of examples: https://www.scala-lang.org/api/2.12.1/scala/util/matching/Regex.html. What might be confusing is the type of the regular expression and its use in pattern matching...
You can construct a regex from any string by using .r:
scala> val regex = "(something)".r
regex: scala.util.matching.Regex = (something)
Your regex becomes an object that has a few useful methods to be able to find matching groups like findAllIn.
In Scala it's idiomatic to use pattern matching for safe extraction of values, thus Regex class also has unapplySeq method to support pattern matching. This makes it an extractor object. You can use it directly (not common):
scala> regex.unapplySeq("something")
res1: Option[List[String]] = Some(List(something))
or you can let Scala compiler call it for you when you do pattern matching:
scala> "something" match {
| case regex(x) => x
| case _ => ???
| }
res2: String = something
You might ask why exactly this return type on unapply/unapplySeq. The doc explains it very well:
The return type of an unapply should be chosen as follows:
If it is just a test, return a Boolean. For instance case even().
If it returns a single sub-value of type T, return an Option[T].
If you want to return several sub-values T1,...,Tn, group them in an optional tuple Option[(T1,...,Tn)].
Sometimes, the number of values to extract isn’t fixed and we would
like to return an arbitrary number of values, depending on the input.
For this use case, you can define extractors with an unapplySeq method
which returns an Option[Seq[T]]. Common examples of these patterns
include deconstructing a List using case List(x, y, z) => and
decomposing a String using a regular expression Regex, such as case
r(name, remainingFields # _*) =>
In short your regex might match one or more groups, thus you need to return a list/seq. It has to be wrapped in an Option to comply with extractor contract.
The way you are using regex is correct, I would just map your function over the input array to avoid creating mutable maps. Perhaps something like this:
class Lexer {
private def getCharacterType(char: Character): Any = {
val Operator = "([-+*/^%=()])".r
val Digit = "([\\d])".r
//val Other = "[^\\d][^-+*/^%=()]".r
char.toString match {
case Operator(c) => "Operator"
case Digit(c) => Integer.parseInt(c)
case _ => "Other" // Temp value, write function for this
}
}
def lex(input: String): Unit = {
val inputArray = input.toArray
val tokens = inputArray.map(x => x -> getCharacterType(x))
for((key, value) <- tokens)
println(key + ": " + value)
}
}
scala> val l = new Lexer()
l: Lexer = Lexer#60f662bd
scala> l.lex("a-1")
a: Other
-: Operator
1: 1

regex of json string in data frame using spark scala

I am having trouble retrieving a value from a JSON string using regex in spark.
My pattern is:
val st1 = """id":"(.*?)"""
val pattern = s"${'"'}$st1${'"'}"
//pattern is: "id":"(.*?)"
My test string in a DF is
import spark.implicits._
val jsonStr = """{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"}"""
val df = sqlContext.sparkContext.parallelize(Seq(jsonStr)).toDF("request")
I am then trying to parse out the id value and add it to the df through a UDF like so:
def getSubStringGroup(pattern: String) = udf((request: String) => {
val patternWithResponseRegex = pattern.r
var subString = request match {
case patternWithResponseRegex(idextracted) => Array(idextracted)
case _ => Array("na")
}
subString
})
val dfWithIdExtracted = df.select($"request")
.withColumn("patternMatchGroups", getSubStringGroup(pattern)($"request"))
.withColumn("idextracted", $"patternMatchGroups".getItem(0))
.drop("patternMatchGroups")
So I want my df to look like
|------------------------------------------------------------- | ------------------------|
| request | id |
|------------------------------------------------------------- | ------------------------|
|{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"} | 1d5482864c60d5bd07919490|
| -------------------------------------------------------------|-------------------------|
However, when I try the above method, my match comes back as "null" despite working on regex101.com
Could anyone advise or suggest a different method? Thank you.
Following Krzysztof's solution, my table now looks like so:
|------------------------------------------------------------- | ------------------------|
| request | id |
|------------------------------------------------------------- | ------------------------|
|{"type":"x","identifier":"y","id":"1d5482864c60d5bd07919490"} | "id":"1d5482864c60d5bd07919490"|
| -------------------------------------------------------------|-------------------------|
I created a new udf to trim the unnecessary characters and added it to the df:
def trimId = udf((idextracted: String) => {
val id = idextracted.drop(6).dropRight(1)
id
})
val dfWithIdExtracted = df.select($"request")
.withColumn("patternMatchGroups", getSubStringGroup(pattern)($"request"))
.withColumn("idextracted", $"patternMatchGroups".getItem(0))
.withColumn("id", trimId($"idextracted"))
.drop("patternMatchGroups", "idextracted")
The df now looks as desired. Thanks again Krzysztof!
When you're using pattern matching with regex, you're trying to match whole string, which obviously can't succeed. You should rather use findFirstMatchIn:
def getSubStringGroup(pattern: String) = udf((request: String) => {
val patternWithResponseRegex = pattern.r
patternWithResponseRegex.findFirstIn(request).map(Array(_)).getOrElse(Array("na"))
})
You're also creating your pattern in a very bizarre way unless you've got special use case for it. You could just do:
val pattern = """"id":"(.*?)""""

Parsing simple query syntax

Let's say I have a query string like that:
#some terms! "phrase query" in:"my container" in:group_3
or
#some terms!
or
in:"my container" in:group_3 terms! "phrase query"
or
in:"my container" test in:group_3 terms!
What is the best way to parse this correctly?
I've looked at Lucene's SimpleQueryParser but it seems quite complicated for my usecase. And I'm trying to parse that query using regexes but not really successful until now, mostly due to the possibility of using whitespace inside quotes
Any simple idea?
I just need to get as output a list of elements, afterward it's pretty easy for me to solve the rest of the problem:
[
"#some",
"terms!",
"phrase query",
"in:\"my container\"",
"in:group_3"
]
The following regex matches the text of your output:
(?:\S*"(?:[^"]+)"|\S+)
See the demo
Just for those interested, here's the final Scala/Java parser I used to solve my problem, inspired by answers in this question:
def testMatcher(query: String): Unit = {
def optionalPrefix(groupName: String) = s"(?:(?:(?<$groupName>[a-zA-Z]+)[:])?)"
val quoted = optionalPrefix("prefixQuoted") + "\"(?<textQuoted>[^\"]*)\""
val unquoted = optionalPrefix("prefixUnquoted") + "(?<textUnquoted>[^\\s\"]+)"
val regex = quoted + "|" + unquoted
val matcher = regex.r.pattern.matcher(query)
var results: List[QueryTerm] = Nil
while (matcher.find()) {
val quotedResult = Option(matcher.group("textQuoted")).map(textQuoted =>
(Option(matcher.group("prefixQuoted")),textQuoted)
)
val unquotedResult = Option(matcher.group("textUnquoted")).map(textUnquoted =>
(Option(matcher.group("prefixUnquoted")),textUnquoted)
)
val anyResult = quotedResult.orElse(unquotedResult).get
results = QueryTerm(anyResult._1,anyResult._2) :: results
}
println(s"results=${results.mkString("\n")}")
}

regular expression matching string in scala

I have a string like this
result: String = /home/administrator/com.supai.common-api-1.8.5-DEV- SNAPPSHOT/com/a/infra/UserAccountDetailsMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV- SNAPSHOT/com/a/infra/UserAccountDetailsMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserAccountMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserAccountMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenFunctionMetaDataMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenFunctionMetaDataMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenPermissionMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenPermissionMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserRoleMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserRoleMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV- SNAPSHOT/com/a/infra/VendorAddressMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/VendorAddressMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/reactore/infra/VendorContactMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/reactore/infra/VendorContactMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/reactore/infra/VendorMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/VendorMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WeekMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WeekMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowMetadataMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowMetadataMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowNotificationMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowNotificationMetaData.class
/home/a/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/a/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/home/common/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/raghav/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/sysadmin/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/tmp/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
regex: scala.util.matching.Regex = (\\/([u|s|r])\\/([s|h|a|r|e]))
x: scala.util.matching.Regex.MatchIterator = empty iterator`
and out of this how can I get only this part /usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jarand this part can be anywhere in the string, how can I achieve this, I tried using regular expression in Scala but don't know how to use forward slashes, so anybody plz explain how to do this in scala.
What is your search criteria? Your pattern seems to be wrong.
In your rexexp, I see u|s|r which means to search for either u, or s or r . See here for more information
how can I get only this part
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jarand
this part can be anywhere in the string
If you are looking for a path, see the below example:
scala> val input = """/home/common/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/raghav/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/sysadmin/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/tmp/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
| /usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar"""
input: String =
/home/common/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/raghav/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/sysadmin/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/tmp/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
scala> val myRegExp = "/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar".r
myRegExp: scala.util.matching.Regex = /usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
scala> val myRegExp2 = "helloWorld.jar".r
myRegExp2: scala.util.matching.Regex = helloWorld.jar
scala> (myRegExp findAllIn input) foreach( println)
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
scala> (myRegExp2 findAllIn input) foreach( println)
scala>