Dynamic String Masking in scala - regex

Is there any simple way to do data masking in scala, can anyone please explain. I want to dynamically change the matching patterns to X with same keyword lengths
Example:
patterns to mask:
Narendra\s*Modi
Trump
JUN-\d\d
Input:
Narendra Modi pm of india 2020-JUN-03
Donald Trump president of USA
Ouput:
XXXXXXXX XXXX pm of india 2020-XXX-XX
Donald XXXXX president of USA
Note:Only characters should be masked, i want to retain space or hyphen in output for matching patterns

So you have an input String:
val input =
"Narendra Modi of India, 2020-JUN-03, Donald Trump of USA."
Masking off a given target with a given length is trivial.
input.replaceAllLiterally("abc", "XXX")
If you have many such targets of different lengths then it becomes more interesting.
"India|USA".r.replaceAllIn(input, "X" * _.matched.length)
//res0: String = Narendra Modi of XXXXX, 2020-JUN-03, Donald Trump of XXX.
If you have a mix of masked characters and kept characters, multiple targets can still be grouped together, but they must have the same number of sub-groups and the same pattern of masked-group to kept-group.
In this case the pattern is (mask)(keep)(mask).
raw"(Narendra)(\s+)(Modi)|(Donald)(\s+)(Trump)|(JUN)([-/])(\d+)".r
.replaceAllIn(input,{m =>
val List(a,b,c) = m.subgroups.flatMap(Option(_))
"X"*a.length + b + "X"*c.length
})
//res1: String = XXXXXXXX XXXX of India, 2020-XXX-XX, XXXXXX XXXXX of USA.

Something like that?
val pattern = Seq("Modi", "Trump", "JUN")
val str = "Narendra Modi pm of india 2020-JUN-03 Donald Trump president of USA"
def mask(pattern: Seq[String], str: String): String = {
var s = str
for (elem <- pattern) {
s = s.replaceAll(elem,elem.toCharArray.map(s=>"X").mkString)
}
s
}
print(mask(pattern,str))
out:
Narendra XXXX pm of india 2020-XXX-03 Donald XXXXX president of USA

scala> val pattern = Seq("Narendra\\s*Modi", "Trump", "JUN-\\d\\d", "Trump", "JUN")
pattern: Seq[String] = List(Narendra\s*Modi, Trump, JUN-\d\d, Trump, JUN)
scala> print(mask(pattern,str))
XXXXXXXXXXXXXXX pm of india 2020-XXXXXXXX Donald XXXXX president of USA
Yeah, It should work, try like above.

Please find the regex and code explanation inline
import org.apache.spark.sql.functions._
object RegExMasking {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
//Regex to fetch the word
val regEx : String = """(\s+[A-Z|a-z]+\s)""".stripMargin
//load your Dataframe
val df = List("Narendra Modi pm of india 2020-JUN-03",
"Donald Trump president of USA ").toDF("sentence")
df.withColumn("valueToReplace",
//Fetch the 1st word from the regex parse expression
regexp_extract(col("sentence"),regEx,0)
)
.map(row => {
val sentence = row.getString(0)
//Trim for extra spaces
val valueToReplace : String = row.getString(1).trim
//Create masked string of equal length
val replaceWith = List.fill(valueToReplace.length)("X").mkString
// Return sentence , masked sentence
(sentence,sentence.replace(valueToReplace,replaceWith))
}).toDF("sentence","maskedSentence")
.show()
}
}

Related

Scala regex on a whole column

I have the following pattern that I could parse using pandas in Python, but struggle with translating the code into Scala.
grade string_column
85 (str:ann smith,14)(str:frank chase,15)
86 (str:john foo,15)(str:al more,14)
In python I used:
df.set_index('grade')['string_column']\
.str.extractall(r'\((str:[^,]+),(\d+)\)')\
.droplevel(1)
with the output:
grade 0 1
85 str:ann smith 14
85 str:frank chase 15
86 str:john foo 15
86 str:al more 14
In Scala I tried to duplicate the approach, but it's failing:
import scala.util.matching.Regex
val pattern = new Regex("((str:[^,]+),(\d+)\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
println(pattern findAllIn(str)).mkString(","))
There are a few notes about the code:
There is an unmatched parenthesis for a group, but that one should be escaped
The backslashes should be double escaped
In the println you don't have to use all the parenthesis and the dot
findAllIn returns a MatchIterator, and looping those will expose a matched string. Joining those matched strings with a comma, will in this case give back the same string again.
For example
import scala.util.matching.Regex
val pattern = new Regex("\\((str:[^,]+),(\\d+)\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
println(pattern findAllIn str mkString ",")
Output
(str:ann smith,14),(str:frank chase,15)
But if you want to print out the group 1 and group 2 values, you can use findAllMatchIn that returns a collection of Regex Matches:
import scala.util.matching.Regex
val pattern = new Regex("\\((str:[^,]+),(\\d+)\\)")
val str = "(str:ann smith,14)(str:frank chase,15)"
pattern findAllMatchIn str foreach(m => {
println(m.group(1))
println(m.group(2))
}
)
Output
str:ann smith
14
str:frank chase
15
In Python, Series.str.extractall only returns captured substrings. In Scala, findAllIn returns the matched values if you do not query its matchData property that in its turn contains a subgroups property.
So, to get the captures only in Scala, you need to use
val pattern = """\((str:[^,()]+),(\d+)\)""".r
val str = "(str:ann smith,14)(str:frank chase,15)"
(pattern findAllIn str).matchData foreach {
m => println(m.subgroups.mkString(","))
}
Output:
str:ann smith,14
str:frank chase,15
See the Scala online demo.
Here, m.subgroups accesses all subgroups (captures) of each match (m).
Also, note you do not need to double backslashes in triple-quoted string literals. \((str:[^,()]+),(\d+)\) matches
\( - a ( char
(str:[^,()]+) - Group 1: str: and one or more chars other than ,, ( and )
, - a comma
(\d+) - Group 2: one or more digits
\) - a ) char.
If you just want to get all matches without captures, you can use
val pattern = """\((str:[^,]+),(\d+)\)""".r
println((pattern findAllIn str).matchData.mkString(","))
Output:
(str:ann smith,14),(str:frank chase,15)
See the online demo.

Regex extract string based on String match

I have this data with some messy addresses inside which contains sometimes not in order a Province, District, and ward :
Name ADDRESS
Store1 453, Duy Tan, Phuong Nguyen Nghiem, Thanh pho Quang Ngai
Store2 13 DUNG SY THANH KHE, P. THANH KHE TAY
Store3 98 Phan Xich Long- P. 2
Store4 306 B4, NGUYENVAN LINH, Ward - 5
Store5 22, Ngo 421/16, Tran Duy Hung, To 42, Phuong Trung Hoa, Quan Cau Giay
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
//Replace each \ with \\ so that C# doesn't treat \ as escape character
//Pattern: Start of string, any integers, 0 or 1 letter, end of word
string sPattern = "^[0-9]+([A-Za-z]\\b)?";
string sString = Row.ADDRESS ?? ""; //Coalesce to empty string if NULL
//Find any matches of the pattern in the string
Match match = Regex.Match(sString, sPattern, RegexOptions.IgnoreCase);
//If a match is found
if (match.Success)
//Return the first match into the new
//HouseNumber field
Row.ward= match.Groups[0].Value;
else
//If not found, leave the HouseNumber blank
Row.ward= "";
}
}
I would like to modify my regex formula to return the data like this in the column Ward. (you can see the synonyms in my addresses (Phuong,P.,ward,etc).
Name ADDRESS ward
Store1 453, Duy Tan, Phuong Nguyen Nghiem, Quang Ngai Phuong Nguyen Nghiem
Store2 13 DUNG SY THANH KHE, P. THANH KHE TAY Phuong THANH KHE TAY
Store3 98 Phan Xich Long- P. 2 Phuong 2
Store4 306 B4, NGUYENVAN LINH, Ward - 5 Phuong 5
Store5 22, Ngo 421/16,--. To 42, Phuong Trung Hoa, Quan Cau Giay Phuong Trung Hoa
I use that regex expression to extract the civic number, but is there a way with REGEX i can modifiu return the data in my column ward like in the example above?
The groups in this regex, as tested in https://regex101.com/, match the data in your column ward, as in your example. However, you may need to better define the patterns where each will appear since this regex only matches them as they appear in your example data. However, it may be enough for you to extrapolate and get the regex that you really need.
(Phuong.*),|P\.(.*$)|Ward - (.*$)
The group in option 1 matches from Phuong (inclusive) until the first comma.
The group in option 2 matches anything that comes after P. until the end of the string.
The group in option 3 matches anything that comes after Ward - until the end of the string.
This one is a bit more advanced, but it only matches what you mentioned in your examples, no groups:
Phuong.*(?=,)|(?<=P\.).*$|(?<=Ward - ).*$
Test it in https://regex101.com to see how it works and to see what each part means.
Finally, you may want to exclude Phuong from the match in option 1 on so that your program can always print Phuong and then the match.

How to regex match everything but long words?

I would like to select all long words from a string: re.findall("[a-z]{3,}")
However, for a reason I can use substitute only. Hence I need to substitute everything but words of 3 and more letters by space. (e.g. abc de1 fgh ij -> abc fgh)
How would such a regex look like?
The result should be all "[a-z]{3,}" concatenated by spaces. However, you can use substitution only.
Or in Python: Find a regex such that
re.sub(regex, " ", text) == " ".join(re.findall("[a-z]{3,}", text))
Here is some test cases
import re
solution_regex="..."
for test_str in ["aaa aa aaa aa",
"aaa aa11",
"11aaa11 11aa11",
"aa aa1aa aaaa"
]:
expected_str = " ".join(re.findall("[a-z]{3,}", test_str))
print(test_str, "->", expected_str)
if re.sub(solution_regex, " ", test_str)!=expected_str:
print("ERROR")
->
aaa aa aaa aa -> aaa aaa
aaa aa11 -> aaa
11aaa11 11aa11 -> aaa
aa aa1aa aaaa -> aaaa
Note that space is no different than any other symbol.
\b(?:[a-z,A-Z,_]{1,2}|\w*\d+\w*)\b
Explanation:
\b means that the substring you are looking for start and end by border of word
(?: ) - non captured group
\w*\d+\w* Any word that contains at least one digit and consists of digits, '_' and letters
Here you can see the test.
You can use the regex
(\s\b(\d*[a-z]\d*){1,2}\b)|(\s\b\d+\b)
and replace with an empty string, here is a python code for the same
import re
regex = r"(\s\b(\d*[a-z]\d*){1,2}\b)|(\s\b\d+\b)"
test_str = "abcd abc ad1r ab a11b a1 11a 1111 1111abcd a1b2c3d"
subst = ""
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)
here is a demo
In Autoit this works for me
#include <Array.au3>
$a = StringRegExp('abc de1 fgh ij 234234324 sdfsdfsdf wfwfwe', '(?i)[a-z]{3,}', 3)
ConsoleWrite(_ArrayToString($a, ' ') & #CRLF)
Result ==> abc fgh sdfsdfsdf wfwfwe
import re
regex = r"(?:^|\s)[^a-z\s]*[a-z]{0,2}[^a-z\s]*(?:\s|$)"
str = "abc de1 fgh ij"
subst = " "
result = re.sub(regex, subst, str)
print (result)
Output:
abc fgh
Explanation:
(?:^|\s) : non capture group, start of string or space
[^a-z\s]* : 0 or more any character that is not letter or space
[a-z]{0,2} : 0, 1 or 2 letters
[^a-z\s]* : 0 or more any character that is not letter or space
(?:\s|$) : non capture group, space or end of string
With the other ideas posted here, I came up with an answer. I can't believe I missed that:
([^a-z]+|(?<![a-z])[a-z]{1,2}(?![a-z]))+
https://regex101.com/r/IIxkki/2
Match either non-letters, or up to two letters bounded by non-letters.

Excel Macro Unable to Separate String Address

Software: MS Excel 2016
Update 1
Please note there can be any number of digits before West, i.e.
123124234234West18th Street
2West 14th Avenue
12324West
Please assist with general solution
Original Question
There is address, 31West 52nd Street I am trying to split the 31 and West so output will be
31 West 52nd Street
Tried this Macro statement but it won't work, please guide
Selection.Replace What:="?#West ", Replacement:=" West " _
, LookAt:=xlPart, SearchOrder:=xlByRows, MatchCase:=False, SearchFormat _
:=False, ReplaceFormat:=False
This is a sample of code, that would check for the first few chars. If they are digits, if would split them with a space from the rest:
Option Explicit
Public Sub TestMe()
Debug.Print fnStrStripMyNumber("31West 52nd Street")
Debug.Print fnStrStripMyNumber("123Vityata Shampion")
End Sub
Public Function fnStrStripMyNumber(strStr As String) As String
Dim lngCountDigits As Long
Dim lngCounter As Long
strStr = Trim(strStr)
For lngCounter = 1 To Len(strStr)
If IsNumeric(Mid(strStr, lngCounter, 1)) Then
lngCountDigits = lngCountDigits + 1
Else
Exit For
End If
Next lngCounter
strStr = Left(strStr, lngCountDigits) & " " & Right(strStr, Len(strStr) - lngCountDigits)
fnStrStripMyNumber = Trim(strStr)
End Function
Thus, from input:
"31West 52nd Street"
"123Vityata Shampion"
We get output:
31 West 52nd Street
123 Vityata Shampion
You can try this excel formula as well,
=LEFT(A1,FIND("West",A1)-1)&" "&RIGHT(A1,LEN(A1)-FIND("West",A1)+1)
Or if you want a macro only,
Sub rep()
Range("B1") = Replace(Range("A1"), "West", " West")
End Sub

Scala regular expression (xml parsing)

I'm parsing an xml file, that has nodes with text like this:
<img src="someUrl1"> American Dollar 1USD | 2,8567 | sometext
<img src="someUrl2"> Euro 1EUR | 3,9446 | sometext
<img src="someUrl3"> Japanese Jen 100JPY | 3,4885 | sometext
What I want to get is values like this:
American Dollar, USD, 2,8576
Euro, EUR, 3,9446
Japanese Jen, JPY, 3,4885
I wonder how could I write the regular expression for this. Scala has some weird regular expressions and I can't figure it out.
If I am understanding you correct, you just want to use regex to get your informations. In this case, you can use the Extractor functionality of Scala and do something like this:
scala> val RegexParser = """(.*) \d+([A-Z]+) \| (.*) \|.*""".r
RegexParser: scala.util.matching.Regex = (.*) \d+([A-Z]+) \| (.*) \|.*
scala> val RegexParser(name,shortname,value) = "American Dollar 1USD | 2,8567 | sometext"
name: String = American Dollar
shortname: String = USD
value: String = 2,8567
scala> val RegexParser(name,shortname,value) = "Euro 1EUR | 3,9446 | sometext"
name: String = Euro
shortname: String = EUR
value: String = 3,9446
scala> val RegexParser(name,shortname,value) = "Japanese Jen 100JPY | 3,4885 | sometext"
name: String = Japanese Jen
shortname: String = JPY
value: String = 3,4885
First, you create an Extractor based on a Regex-String. This can be done by calling r on a String (class StringOps to be exact). After that you can use this Extractor to read out all matched elements (name, shortname, value). In this blog post you will find a good explanation.