Regex FindAll not printing results Kotlin - regex

I have a program that is using ML Kit to use Text recognition on a document and I am taking that data and only printing the prices. So I am taking the Text Recognition String and passing it through the regex below:
val reg = Regex("\$([0-9]*.[0-9]{2})")
val matches = reg.findAll(rec)
val prices = matches.map{it.groupValues[0]}.joinToString()
recogResult.text = prices
I have tested the Regex formula on another website and it grabs all the right data. However it is printing nothing. When it gets to the reg.findAll(rec) part matches = kotlin.sequences.GeneratorSequence#bd56ff3 and prices = "".

You can use
val reg = Regex("""\$[0-9]*\.[0-9]{2}""")
val matches = reg.findAll("Price: \$1234.56 and \$1.56")
val prices = matches.map{it.groupValues[0]}.joinToString()
See the online demo. Notes:
"""...""" is a triple quoted string literal where backslashes are parsed as literal \ chars and are not used to form string escape sequences
\$ - in a triple quoted string literal defines a \$ regex escape that matches a literal $ char
[0-9]*\.[0-9]{2} matches zero or more digits, . and two digits.
Note that you may use \p{Sc} to match any currency chars, not just $.
If you want to make sure no other digit follows the two fractional digits, add (?![0-9]) at the end of your regex.

Related

Regex to replace all non numbers but allow a '+' prefix

I want to delete all invalid letters from a string which should represent a phone number. Only a '+' prefix and numbers are allowed.
I tried in Kotlin with
"+1234abc567+".replace("[^+0-9]".toRegex(), "")
It works nearly perfect, but it does not replace the last '+'.
How can I modify the regex to only allow the first '+'?
You could do a regex replacement on the following pattern:
(?<=.)\+|[^0-9+]+
Sample script:
String input = "+1234abc567+";
String output = input.replaceAll("(?<=.)\\+|[^0-9+]+", "");
System.out.println(input); // +1234abc567+
System.out.println(output); // +1234567
Here is an explanation of the regex pattern:
(?<=.)\+ match a literal + which is NOT first (i.e. preceded by >= 1 character)
| OR
[^0-9+]+ match one or more non digit characters, excluding +
You can use
^(\+)|\D+
Replace with the backreference to the first group, $1. See the regex demo.
Details:
^(\+) - a + at the start of string captured into Group 1
| - or
\D+ - one or more non-digit chars.
NOTE: a raw string literal delimited with """ allows the use of a single backslash to form regex escapes, such as \D, \d, etc. Using this type of string literals greatly simplifies regex definitions inside code.
See the Kotlin demo:
val s = "+1234abc567+"
val regex = """^(\+)|\D+""".toRegex()
println(s.replace(regex, "$1"))
// => +1234567

Regex for masking data

I am trying to implement regex for a JSON Response on sensitive data.
JSON response comes with AccountNumber and AccountName.
Masking details are as below.
accountNumber Before: 7835673653678365
accountNumber Masked: 783567365367****
accountName Before : chris hemsworth
accountName Masked : chri* *********
I am able to match above if I just do [0-9]{12} and (?![0-9]{12}), when I replace this, it is replacing only with *, but my regex is not producing correct output.
How can I produce output as above from regex?
If all you want is to mask characters except first N characters, don't think you really a complicated regex. For ignoring first N characters and replacing every character there after with *, you can write a generic regex like this,
(?<=.{N}).
where N can be any number like 1,2,3 etc. and replace the match with *
The way this regex works is, it selects every character which has at least N characters before it and hence once it selects a character, all following characters also get selected.
For e.g in your AccountNumber case, N = 12, hence your regex becomes,
(?<=.{12}).
Regex Demo for AccountNumber masking
Java code,
String s = "7835673653678365";
System.out.println(s.replaceAll("(?<=.{12}).", "*"));
Prints,
783567365367****
And for AccountName case, N = 4, hence your regex becomes,
(?<=.{4}).
Regex Demo for AccountName masking
Java code,
String s = "chris hemsworth";
System.out.println(s.replaceAll("(?<=.{4}).", "*"));
Prints,
chri***********
If you match [0-9]{12} and replace that directly with a single asterix you are left with accountNumber Before: *8365
There is no programming language listed, but one option to replace the digits at the end is to use a positive lookbehind to assert what is on the left are 12 digits followed by a positive lookahead to assert what is on the right are 0+ digits followed by the end of the string.
Then in the replacement use *
If the value of the json exact the value of chris hemsworth and 7835673653678365 you can omit the positive lookaheads (?=\d*$) and (?=[\w ]*$) which assert the end of the string for the following 2 expressions.
Use the versions with the positive lookahead if the data to match is at the end of the string and the string contains more data so you don't replace more matches than you would expect.
(?<=[0-9]{12})(?=\d*$)\d
In Java:
(?<=[0-9]{12})(?=\\d*$)\\d
(?<=[0-9]{12}) Positive lookbehind, assert what is on the left are 12 digits
(?=\d*$) Positive lookahead, assert what is on the right are 0+ digits and assert the end of the string
\d Match a single digit
Regex demo
Result:
783567365367****
For the account name you might do that with 4 word characters \w but this will also replace the whitespace with an asterix because I believe you can not skip matching that space in one regex.
(?<=[\w ]{5})(?=[\w ]*$)[\w ]
In Java
(?<=[\\w ]{4})(?=[\\w ]*$)[\\w ]
Regex demo
Result
chri***********

Scala regex match lines with special characters

I have a code segment that reads lines from a file and I want to filter certain lines out. Basically, I want to filter everything out that has not three tabulator-separated columns, where the first column is a number and the other two columns can contain every character except tabulator and newline (Dos & Unix).
I already checked my regex on http://www.regexr.com/ and there it works.
scala> val mystr = """123456\thttp://some.url/path/to/resource\t\x03U\x1D\x1F\x04D0B0#\xA0>\xA0<\x86:http://some.url/path/to/resource\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04C0A0?\n"""
scala> val myreg = "^[0-9]+(\t[^\t\r\n]+){2}(\n|\r\n)$"
scala> mystr.matches(myreg)
res2: Boolean = false
What I found out is that the problem is related to special characters. For example a simple example:
scala> val tabstr = """123456\t123456"""
scala> val tabreg = "^[0-9]+\t[0-9]+$"
scala> tabstr.matches(tabreg)
res3: Boolean = false
scala> val tabstr = "123456\t123456"
scala> val tabreg = "^[0-9]+\t[0-9]+$"
scala> tabstr.matches(tabreg)
res4: Boolean = true
It seems I mustn't use a raw string for my line (see mystr in the first code block). But if I don't use a raw string scala complains about
error: invalid escape character
So how can I deal with this messy input and still use my regex to filter out some lines?
You are using raw string literals. Inside raw string literals, \ is not used to escape sequences like tab \t or newline \n, the \n in a raw string literal is just 2 characters following each other.
In a regex, to match a literal \, you need to use 2 backslashes in a raw-string literal based regex, and 4 backslashes in a regular string.
So, to match all your inputs, you need to use the following regexps:
val mystr = """23456\thttp://some.url/path/to/resource\t\x03U\x1D\x1F\x04D0B0#\xA0>\xA0<\x86:http://some.url/path/to/resource\x06\x08+\x06\x01\x05\x05\x07\x01\x01\x04C0A0?\n"""
val myreg = """[0-9]+(?:\\t(?:(?!\\[trn]).)*){2}(?:\\r)?(?:\\n)"""
println(mystr.matches(myreg)) // => true
val tabstr = """123456\t123456"""
println(tabstr.matches("""[0-9]+\\t[0-9]+""")) // => true
val tabstr2 = "123456\t123456"
println(tabstr2.matches("""^[0-9]+(?:\\t|\t)[0-9]+$""")) // => true
Non-capturing groups are not of importance here, since you just need to check if a string matches (that means, you do not even need a ^ and $ since the whole input string must match) and you can still use capturing groups. If you later need to extract any matches/capturing groups, non-capturing groups will help you get a "cleaner" output structure, that is it.
The last two regexps are easy enough, (?:\\t|\t) matches either a \+t or a tab. \t just matches a tab.
The first one has a tempered greedy token (this is a simplified regex, a better one can be used with unrolling the loop method: [0-9]+(?:\\t[^\\]*(?:\\(?![trn])[^\\]*)*){2}(?:\\r)?(?:\\n)).
[0-9]+ - 1 or more digits
(?:\\t(?:(?!\\[trn]).)*){2} - tempered greedy token, 2 occurrences of a literal string \t followed by any characters but a newline other than 2-symbol combinations \t or \r or \n.
(?:\\r)? - 1 or 0 occurrences of \r
(?:\\n) - one occurrence of a literal combination of \ and n.

Regex to get text between two strings

I want to get dynamic six seven-digit numbers as shown below:
id="tid_3660328">
and append them to the end of TextBox1.
In other words, I want to get the number: 3660328
From between: id="tid_
and: ">
My question is how I could do this in VB.NET. My first thought was "regex", which is a topic I have zero experience on. I appreciate the help.
Note: I was thinking I could use the code here but with my own regex: https://stackoverflow.com/a/9332731
This is a good place for using RegEx.
If you only want to find numbers that are exactly seven digits you could use this RegEx pattern:
id="tid_(\d{7})">
Or, if you don't care how many digits it is, you could use this pattern:
id="tid_(\d+)">
Here's what the pattern means:
id="tid_ - Matching strings must begin with this text
(...) - Creates a group so that we can later access the value of just this part of the match.
\d - Any numeric digit character
{7} - Seven numeric characters in a row
"> - Matching strings must end with this text
In the second pattern, the +, which replaces the {7} just means one-or-more instead of exactly seven.
In VB.NET, you can search an input string using a RegEx pattern, like this:
Public Function FindNumbers(input As String) As List(Of String)
Dim numbers As New List(Of String)()
Dim pattern As String = "id=""tid_(\d{7})"">"
For Each i As Match In Regex.Matches(input, pattern)
numbers.Add(i.Groups(1).Value)
Next
Return numbers
End Function
Notice that in the string literal in VB, you have to escape the quotation marks by doubling them. You'll also notice that, instead of using i.Value, we are using i.Groups(1).Value. The reason is that i.Value will equal the entire matched string (e.g. id="tid_3660328">), whereas group 1 will equal just the number part (e.g. 3660328).
Update
To answer your question below, to call this function and output it to a TextBox, you could do something like this:
Dim numbers As List(Of String) = FindNumbers("id=""tid_3660328"">")
Text1.Text = String.Join(Environment.NewLine, numbers.ToArray())
Consider the following Regex...
(?<=tid_).*?(?=\"\>)
Explanation:
(?<=tid_) : Match the Prefix tid_ but exclude it from the capture
.*? : Any Character, any number of repetitions, as few as possible
(?=\">) : Match the suffix "> but exclude it from the capture

Recognize numbers in french format inside document using regex

I have a document containing numbers in various formats, french, english, custom formats.
I wanted a regex that could catch ONLY numbers in french format.
This is a complete list of numbers I want to catch (d represents a digit, decimal separator is comma , and thousands separator is space)
d,d d,dd d,ddd
dd,d dd,dd dd,ddd
ddd,d ddd,dd ddd,ddd
d ddd,d d ddd,dd d ddd,ddd
dd ddd,d dd ddd,dd dd ddd,ddd
ddd ddd,d ddd ddd,dd ddd ddd,ddd
d ddd ddd,d...
dd ddd ddd,d...
ddd ddd ddd,d...
This is the regex I have
(\d{1,3}\s(\d{3}\s)*\d{3}(\,\d{1,3})?|\d{1,3}\,\d{1,3})
catches french formats like above, so I am on the right track, but also numbers like d,ddd.dd (because it catches d,ddd) or d,ddd,ddd (because it catches d,ddd ).
What should I add to my regex ?
The VBA code I have:
Sub ChangeNumberFromFRformatToENformat()
Dim SectionText As String
Dim RegEx As Object, RegC As Object, RegM As Object
Dim i As Integer
Set RegEx = CreateObject("vbscript.regexp")
With RegEx
.Global = True
.MultiLine = False
.Pattern = "(\d{1,3}\s(\d{3}\s)*\d{3}(\,\d{1,3})?|\d{1,3}\,\d{1,3})"
' regular expression used for the macro to recognise FR formated numners
End With
For i = 1 To ActiveDocument.Sections.Count()
SectionText = ActiveDocument.Sections(i).Range.Text
If RegEx.test(SectionText) Then
Set RegC = RegEx.Execute(SectionText)
' RegC regular expresion matches collection, holding french format numbers
For Each RegM In RegC
Call ChangeThousandAndDecimalSeparator(RegM.Value)
Next 'For Each RegM In RegC
Set RegC = Nothing
Set RegM = Nothing
End If
Next 'For i = 6 To ActiveDocument.Sections.Count()
Set RegEx = Nothing
End Sub
The user stema, gave me a nice solution. The regex should be:
(?<=^|\s)\d{1,3}(?:\s\d{3})*(?:\,\d{1,3})?(?=\s|$)
But VBA complains that the regexp has unescaped characters. I have found one here (?: \d{3}) between (?: \d{3}) which is a blank character, so I can substitute that with \s. The second one I think is here (?:,\d{1,3}) between ?: and \d, the comma character, and if I escape it will be \, .
So the regex is now (?<=^|\s)\d{1,3}(?:\s\d{3})*(?:\,\d{1,3})?(?=\s|$) and it works fine in RegExr but my VBA code will not accept it.
NEW LINE IN POST :
I have just discovered that VBA doesn't agree with this sequence of the regex ?<=^
What about this?
\b\d{1,3}(?: \d{3})*(?:,\d{1,3})?\b
See it here on Regexr
\b are word boundaries
At first (\d{1,3}) match 1 to 3 digits, then there can be 0 or more groups of a leading space followed by 3 digits ((?: \d{3})*) and at last there can be an optional fraction part ((?:,\d{1,3})?)
Edit:
if you want to avoid 1,111.1 then the \b anchors are not good for you. Try this:
(?<=^|\s)\d{1,3}(?: \d{3})*(?:,\d{1,3})?(?=\s|$)
Regexr
This regex requires now a whitespace or the start of the string before and a whitespace or the end of the string after the number to match.
Edit 2:
Since look behinds are not supported you can change to
(?:^|\s)\d{1,3}(?: \d{3})*(?:,\d{1,3})?(?=\s|$)
This changes nothing at the start of the string, but if the number starts with a leading whitespace, this is now included in the match. If the result of the match is used for something at first the leading whitespace has to be stripped (I am quite sure VBA does have a methond for that (try trim())).
If you are reading on a line by line basis, you might consider adding anchors (^ and $) to your regex, so you will end up with something like so:
^(\d{1,3}\s(\d{3}\s)*\d{3}(\,\d{1,3})?|\d{1,3}\,\d{1,3})$
This instructs the RegEx engine to start matching from the beginning of the line till the very end.