Exercise: given a string with a name, then space or newline, then email, then maybe newline and some text separated by newlines capture the name and the domain of email.
So I created the following:
val regexp = "^([a-zA-Z]+)(?:\\s|\\n)\\w+#(\\w+\\.\\w+)(?:.|\\r|\\n)*".r
def fun(str: String): String = {
val result = str match {
case regexp(name, domain) => name + ' ' + domain
case _ => "invalid"
}
result
}
And started testing:
scala> val input = "oleg oleg#email.com"
scala> fun(input)
res17: String = oleg email.com
scala> val input = "oleg\noleg#email.com"
scala> fun(input)
res18: String = oleg email.com
scala> val input = """oleg
| oleg#email.com
| 7bdaf0a1be3"""
scala> fun(input)
res19: String = oleg email.com
scala> val input = """oleg
| oleg#email.com
| 7bdaf0a1be3
| """
scala> fun(input)
res20: String = invalid
Why doesn't the regexp capture the string with the newline at the end?
This part (?:\\s|\\n) can be shortened to \s as it will also match a newline, and as there is still a space before the emails where you are using multiple lines it can be \s+ to repeat it 1 or more times.
Matching any character like this (?:.|\\r|\\n)* if very inefficient due to the alternation. You can use either [\S\s]* or use an inline modifier (?s) to make the dot match a newline.
But using your pattern to just get the name and the domain of the email you don't have to match what comes after it, as you are using the 2 capturing groups in the output.
^([a-zA-Z]+)\s+\w+#(\w+\.\w+)
Regex demo
If you do want to match all that follows, you can use:
val regexp = """(?s)^([a-zA-Z]+)\s+\w+#(\w+\.\w+).*""".r
def fun(str: String): String = {
val result = str match {
case regexp(name, domain) => name + ' ' + domain
case _ => "invalid"
}
result
}
Scala demo
Note that this pattern \w+#(\w+\.\w+) is very limited for matching an email
Is there any simple way to do data masking in scala, can anyone please explain. I want to dynamically change the matching patterns to X with same keyword lengths
Example:
patterns to mask:
Narendra\s*Modi
Trump
JUN-\d\d
Input:
Narendra Modi pm of india 2020-JUN-03
Donald Trump president of USA
Ouput:
XXXXXXXX XXXX pm of india 2020-XXX-XX
Donald XXXXX president of USA
Note:Only characters should be masked, i want to retain space or hyphen in output for matching patterns
So you have an input String:
val input =
"Narendra Modi of India, 2020-JUN-03, Donald Trump of USA."
Masking off a given target with a given length is trivial.
input.replaceAllLiterally("abc", "XXX")
If you have many such targets of different lengths then it becomes more interesting.
"India|USA".r.replaceAllIn(input, "X" * _.matched.length)
//res0: String = Narendra Modi of XXXXX, 2020-JUN-03, Donald Trump of XXX.
If you have a mix of masked characters and kept characters, multiple targets can still be grouped together, but they must have the same number of sub-groups and the same pattern of masked-group to kept-group.
In this case the pattern is (mask)(keep)(mask).
raw"(Narendra)(\s+)(Modi)|(Donald)(\s+)(Trump)|(JUN)([-/])(\d+)".r
.replaceAllIn(input,{m =>
val List(a,b,c) = m.subgroups.flatMap(Option(_))
"X"*a.length + b + "X"*c.length
})
//res1: String = XXXXXXXX XXXX of India, 2020-XXX-XX, XXXXXX XXXXX of USA.
Something like that?
val pattern = Seq("Modi", "Trump", "JUN")
val str = "Narendra Modi pm of india 2020-JUN-03 Donald Trump president of USA"
def mask(pattern: Seq[String], str: String): String = {
var s = str
for (elem <- pattern) {
s = s.replaceAll(elem,elem.toCharArray.map(s=>"X").mkString)
}
s
}
print(mask(pattern,str))
out:
Narendra XXXX pm of india 2020-XXX-03 Donald XXXXX president of USA
scala> val pattern = Seq("Narendra\\s*Modi", "Trump", "JUN-\\d\\d", "Trump", "JUN")
pattern: Seq[String] = List(Narendra\s*Modi, Trump, JUN-\d\d, Trump, JUN)
scala> print(mask(pattern,str))
XXXXXXXXXXXXXXX pm of india 2020-XXXXXXXX Donald XXXXX president of USA
Yeah, It should work, try like above.
Please find the regex and code explanation inline
import org.apache.spark.sql.functions._
object RegExMasking {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
//Regex to fetch the word
val regEx : String = """(\s+[A-Z|a-z]+\s)""".stripMargin
//load your Dataframe
val df = List("Narendra Modi pm of india 2020-JUN-03",
"Donald Trump president of USA ").toDF("sentence")
df.withColumn("valueToReplace",
//Fetch the 1st word from the regex parse expression
regexp_extract(col("sentence"),regEx,0)
)
.map(row => {
val sentence = row.getString(0)
//Trim for extra spaces
val valueToReplace : String = row.getString(1).trim
//Create masked string of equal length
val replaceWith = List.fill(valueToReplace.length)("X").mkString
// Return sentence , masked sentence
(sentence,sentence.replace(valueToReplace,replaceWith))
}).toDF("sentence","maskedSentence")
.show()
}
}
Given a string like abab/docId/example-doc1-2019-01-01, I want to use Regex to extract these values:
firstPart = example
fullString = example-doc1-2019-01-01
I have this:
import scala.util.matching.Regex
case class Read(theString: String) {
val stringFormat: Regex = """.*\/docId\/([A-Za-z0-9]+)-([A-Za-z0-9-]+)$""".r
val stringFormat(firstPart, fullString) = theString
}
But this separates it like this:
firstPart = example
fullString = doc1-2019-01-01
Is there a way to retain the fullString and do a regex on that to get the part before the first hyphen? I know I can do this using the String split method but is there a way do it using regex?
You may use
val stringFormat: Regex = ".*/docId/(([A-Za-z0-9])+-[A-Za-z0-9-]+)$".r
||_ Group 2 _| |
| |
|_________________ Group 1 __|
See the regex demo.
Note how capturing parentheses are re-arranged. Also, you need to swap the variables in the regex match call, see demo below (fullString should come before firstPart).
See Scala demo:
val theString = "abab/docId/example-doc1-2019-01-01"
val stringFormat = ".*/docId/(([A-Za-z0-9]+)-[A-Za-z0-9-]+)".r
val stringFormat(fullString, firstPart) = theString
println(s"firstPart: '$firstPart'\nfullString: '$fullString'")
Output:
firstPart: 'example'
fullString: 'example-doc1-2019-01-01'
Is there a way to return the first instance of an unmatched string between 2 strings with Scala's Regex library?
For example:
val a = "some text abc123 some more text"
val b = "some text xyz some more text"
a.firstUnmatched(b) = "abc123"
Regex is good for matching & replacing in strings based on patterns.
But to look for the differences between strings? Not exactly.
However, diff can be used to find differences.
object Main extends App {
val a = "some text abc123 some more text 321abc"
val b = "some text xyz some more text zyx"
val firstdiff = (a.split(" ") diff b.split(" "))(0)
println(firstdiff)
}
prints "abc123"
Is regex desired after all? Then realize that the splits could be replaced by regex matching.
The regex pattern in this example looks for words:
val reg = "\\w+".r
val firstdiff = (reg.findAllIn(a).toList diff reg.findAllIn(b).toList)(0)
I have a string like this
result: String = /home/administrator/com.supai.common-api-1.8.5-DEV- SNAPPSHOT/com/a/infra/UserAccountDetailsMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV- SNAPSHOT/com/a/infra/UserAccountDetailsMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserAccountMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserAccountMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenFunctionMetaDataMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenFunctionMetaDataMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenPermissionMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserOverridenPermissionMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserRoleMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/UserRoleMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV- SNAPSHOT/com/a/infra/VendorAddressMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/VendorAddressMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/reactore/infra/VendorContactMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/reactore/infra/VendorContactMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/reactore/infra/VendorMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/VendorMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WeekMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WeekMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowMetadataMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowMetadataMetaData.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowNotificationMetaData$.class
/home/administrator/com.supai.common-api-1.8.5-DEV-SNAPSHOT/com/a/infra/WorkflowNotificationMetaData.class
/home/a/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/a/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/home/common/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/raghav/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/sysadmin/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/tmp/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
regex: scala.util.matching.Regex = (\\/([u|s|r])\\/([s|h|a|r|e]))
x: scala.util.matching.Regex.MatchIterator = empty iterator`
and out of this how can I get only this part /usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jarand this part can be anywhere in the string, how can I achieve this, I tried using regular expression in Scala but don't know how to use forward slashes, so anybody plz explain how to do this in scala.
What is your search criteria? Your pattern seems to be wrong.
In your rexexp, I see u|s|r which means to search for either u, or s or r . See here for more information
how can I get only this part
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jarand
this part can be anywhere in the string
If you are looking for a path, see the below example:
scala> val input = """/home/common/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/raghav/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/sysadmin/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/tmp/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
| /home/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
| /usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar"""
input: String =
/home/common/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/raghav/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/sysadmin/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/tmp/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.3-SNAPSHOT.jar
/home/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
scala> val myRegExp = "/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar".r
myRegExp: scala.util.matching.Regex = /usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
scala> val myRegExp2 = "helloWorld.jar".r
myRegExp2: scala.util.matching.Regex = helloWorld.jar
scala> (myRegExp findAllIn input) foreach( println)
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
/usr/share/common-api/lib/com.supai.common-api-1.8.5-DEV-SNAPSHOT.jar
scala> (myRegExp2 findAllIn input) foreach( println)
scala>