Overlapping matches in Regex - Scala - regex

I'm trying to extract all posible combinations of 3 letters from a String following the pattern XYX.
val text = "abaca dedfd ghgig"
val p = """([a-z])(?!\1)[a-z]\1""".r
p.findAllIn(text).toArray
When I run the script I get:
aba, ded, ghg
And it should be:
aba, aca, ded, dfd, ghg, gig
It does not detect overlapped combinations.

The way consists to enclose the whole pattern in a lookahead to consume only the start position:
val p = """(?=(([a-z])(?!\2)[a-z]\2))""".r
p.findAllIn(text).matchData foreach {
m => println(m.group(1))
}
The lookahead is only an assertion (a test) for the current position and the pattern inside doesn't consume characters. The result you are looking for is in the first capture group (that is needed to get the result since the whole match is empty).

You need to capture the whole pattern and put it inside a positive lookahead. The code in Scala will be the following:
object Main extends App {
val text = "abaca dedfd ghgig"
val p = """(?=(([a-z])(?!\2)[a-z]\2))""".r
val allMatches = p.findAllMatchIn(text).map(_.group(1))
println(allMatches.mkString(", "))
// => aba, aca, ded, dfd, ghg, gig
}
See the online Scala demo
Note that the backreference will turn to \2 as the group to check will have ID = 2 and Group 1 will contain the value you need to collect.

Related

How to capture multiple number ranges (e.g. 1-6,25-41,55-70) with each range in its own group?

At the moment I have an Excel sheet with a column holding data in this form:
E 1-6,44-80
E 10-76
E 44-80,233-425
E 19-55,62-83,86-119,200-390
...
I need to be able to capture each range of numbers individually. For example, I would like the first line above to result in "1-6" and "44-80" being captured into their own groups.
So, essentially I need to capture a repeating group.
When trying to use this pattern, which uses the general form for capturing repeating groups given by #ssent1 on this question:
E\s(([0-9]{1,4})-([0-9]{1,4}))((?:,([0-9]{1,4})-([0-9]{1,4}))*)
I end up only matching the first and last number ranges. I understand that this is because I'm repeating captured groups rather than capturing a repeating group, but I can't figure out how to correct my pattern. Any help would be greatly appreciated.
In Java you can make use of a capture group and the \G anchor to get continuous matches:
(?:^E\h+|\G(?!^),?(\d{1,4}-\d{1,4}))
Regex demo | Java demo
Example
String regex = "(?:^E\\h+|\\G(?!^),?(\\d{1,4}-\\d{1,4}))";
String string = "E 1-6,44-80\n"
+ "E 10-76\n"
+ "E 44-80,233-425\n"
+ "E 19-55,62-83,86-119,200-390\n"
+ "200-390";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
if (matcher.group(1) != null) {
System.out.println(matcher.group(1));
}
}
Output
1-6
44-80
10-76
44-80
233-425
19-55
62-83
86-119
200-390

RegEx repeating subcapture returns only last match

I have a regexp that tries to find multiple sub captures within matches:
https://regex101.com/r/nk1Q5J/2/
=\s?.*(?:((?i)Fields\("(\w+)"\)\.Value\)*?))
I've had simpler versions with equivalent results but this is the last iteration.
the Trick here is that the first group looks for a sequence that begins with '=' (to identify database field reads in VB.Net)
The single sub capture cases work:
Match 1. [comparison + single parameter read call]
= False And IsNull(Fields("lmpRoundedEndTime").Value)
=> G2: lmpRoundedEndTime
Match 3. [read into string] oRs.Open "select lmpEmployeeID,lmpShiftID,lmpTimecardDate,lmpProjectID from Timecards where lmpTimecardID = " + App.Convert.NumberToSql(Fields("lmlTimecardID").Value),Connection,adOpenStatic,adLockReadOnly,adCmdText
=> G2: lmlTimecardID
Match 4. [assignment] Fields("lmlEmployeeID").Value = oRs.Fields("lmpEmployeeID").Value
Where I am failing is a match with multiple sub-captures. My regexp returns the last (intended) sub capture :
Match 2. [read multiple input parameters] Fields("lmpPayrollHours").Value = App.Ax("TimecardFunctions").CalculateHours(Fields("lmpShiftID").Value,Fields("lmpRoundedStartTime").Value,Fields("lmpRoundedEndTime").Value)
=> G2: lmpRoundedEndTime
'''''''''''' ^ must capture: lmpShiftID , lmpRoundedStartTime , lmpRoundedEndTime
I've read up on lazy quantifiers etc, but can't wrap my head around where this goes wrong.
References:
https://www.regular-expressions.info/refrepeat.html
http://www.rexegg.com/regex-quantifiers.html
Related:
Regular expression: repeating groups only getting last group
What's the difference between "groups" and "captures" in .NET regular expressions?
BTW, I could quantify the sub capture as {1,5} safely for efficiency, but that's not the focus.
EDIT:
By using negative lookahead to exclude the left side of comparisons, this got me much closer (match 2 above now works):
(?:Fields\("(\w+)"\)\.Value)(?!\)?\s{0,2}[=|<])
but in the following block of code, only the first two are captured:
If oRs.EOF = False Then
If CInt(Fields("lmlTimecardType").Value) = 1 Then
If Trim(oRs.Fields("lmeDirectExpenseID").Value) <> "" Then
Fields("lmlExpenseID").Value = oRs.Fields("lmeDirectExpenseID").Value
End If
Else
If Trim(oRs.Fields("lmeIndirectExpenseID").Value) <> "" Then
Fields("lmlExpenseID").Value = oRs.Fields("lmeIndirectExpenseID").Value
End If
End If
If CInt(Fields("lmlTimecardType").Value) = 2 Then
If Trim(oRs.FIelds("lmeDefaultWorkCenterID").Value) <> "" Then
Fields("lmlWorkCenterID").Value = oRs.FIelds("lmeDefaultWorkCenterID").Value
End If
End If
End If
Capture1:
Fields("lmlExpenseID").Value = oRs.Fields("lmeDirectExpenseID").Value
Capture2:
Fields("lmlExpenseID").Value = oRs.Fields("lmeIndirectExpenseID").Value
Capture3 (failed):
Fields("lmlWorkCenterID").Value = oRs.FIelds("lmeDefaultWorkCenterID").Value
Actually, (?:Fields\("(\w+)"\)\.Value)(?!\)?\s{0,2}[=|<]) does work in my Excel sheet, just not in the regex101 test. Probably a slightly different standard used there.
https://regex101.com/r/p6zFqy/1/

Regex to match a Number between two strings [duplicate]

I have a string that looks like the following:
<#399969178745962506> hello to <#!104729417217032192>
I have a dictionary containing both that looks like following:
{"399969178745962506", "One"},
{"104729417217032192", "Two"}
My goal here is to replace the <#399969178745962506> into the value of that number key, which in this case would be One
Regex.Replace(arg.Content, "(?<=<)(.*?)(?=>)", m => userDic.ContainsKey(m.Value) ? userDic[m.Value] : m.Value);
My current regex is as following: (?<=<)(.*?)(?=>) which only matches everything in between < and > which would in this case leave both #399969178745962506 and #!104729417217032192
I can't just ignore the # sign, because the ! sign is not there every time. So it could be optimal to only get numbers with something like \d+
I need to figure out how to only get the numbers between < and > but I can't for the life of me figure out how.
Very grateful for any help!
In C#, you may use 2 approaches: a lookaround based on (since lookbehind patterns can be variable width) and a capturing group approach.
Lookaround based approach
The pattern that will easily help you get the digits in the right context is
(?<=<#!?)\d+(?=>)
See the regex demo
The (?<=<#!?) is a positive lookbehind that requires <= or <=! immediately to the left of the current location and (?=>) is a positive lookahead that requires > char immediately to the right of the current location.
Capturing approach
You may use the following pattern that will capture the digits inside the expected <...> substrings:
<#!?(\d+)>
Details
<# - a literal <# substring
!? - an optional exclamation sign
(\d+) - capturing group 1 that matches one or more digits
> - a literal > sign.
Note that the values you need can be accessed via match.Groups[1].Value as shown in the snippet above.
Usage:
var userDic = new Dictionary<string, string> {
{"399969178745962506", "One"},
{"104729417217032192", "Two"}
};
var p = #"<#!?(\d+)>";
var s = "<#399969178745962506> hello to <#!104729417217032192>";
Console.WriteLine(
Regex.Replace(s, p, m => userDic.ContainsKey(m.Groups[1].Value) ?
userDic[m.Groups[1].Value] : m.Value
)
); // => One hello to Two
// Or, if you need to keep <#, <#! and >
Console.WriteLine(
Regex.Replace(s, #"(<#!?)(\d+)>", m => userDic.ContainsKey(m.Groups[2].Value) ?
$"{m.Groups[1].Value}{userDic[m.Groups[2].Value]}>" : m.Value
)
); // => <#One> hello to <#!Two>
See the C# demo.
To extract just the numbers from you're given format, use this regex pattern:
(?<=<#|<#!)(\d+)(?=>)
See it work in action: https://regexr.com/3j6ia
You can use non-capturing groups to exclude parts of the needed pattern to be inside the group:
(?<=<)(?:#?!?)(.*?)(?=>)
alternativly you could name the inner group and use the named group to get it:
(?<=<)(?:#?!?)(?<yourgroupname>.*?)(?=>)
Access it via m.Groups["yourgroupname"].Value - more see f.e. How do I access named capturing groups in a .NET Regex?
Regex: (?:<#!?(\d+)>)
Details:
(?:) Non-capturing group
<# matches the characters <# literally
? Matches between zero and one times
(\d+) 1st Capturing Group \d+ matches a digit (equal to [0-9])
Regex demo
string text = "<#399969178745962506> hello to <#!104729417217032192>";
Dictionary<string, string> list = new Dictionary<string, string>() { { "399969178745962506", "One" }, { "104729417217032192", "Two" } };
text = Regex.Replace(text, #"(?:<#!?(\d+)>)", m => list.ContainsKey(m.Groups[1].Value) ? list[m.Groups[1].Value] : m.Value);
Console.WriteLine(text); \\ One hello to Two
Console.ReadLine();

Regex - capture multiple groups and combine them multiple times in one string

I need to combine some text using regex, but I'm having a bit of trouble when trying to capture and substitute my string. For example - I need to capture digits from the start, and add them in a substitution to every section closed between ||
I have:
||10||a||ab||abc||
I want:
||10||a10||ab10||abc10||
So I need '10' in capture group 1 and 'a|ab|abc' in capture group 2
I've tried something like this, but it doesn't work for me (captures only one [a-z] group)
(?=.*\|\|(\d+)\|\|)(?=.*\b([a-z]+\b))
I would achieve this without a complex regular expression. For example, you could do this:
input = "||10||a||ab||abc||"
parts = input.scan(/\w+/) # => ["10", "a", "ab", "abc"]
parts[1..-1].each { |part| part << parts[0] } # => ["a10", "ab10", "abc10"]
"||#{parts.join('||')}||"
str = "||10||a||ab||abc||"
first = nil
str.gsub(/(?<=\|\|)[^\|]+/) { |s| first.nil? ? (first = s) : s + first }
#=> "||10||a10||ab10||abc10||"
The regular expression reads, "match one or more characters in a pipe immediately following two pipes" ((?<=\|\|) being a positive lookbehind).

How to find the whole word in kotlin using regex

I want to find the whole word in string. But I don't know how to find the all word in kotlin. My finding word is [non alpha]cba[non alpha]. My example code is bellows.
val testLink3 = """cba#cba cba"""
val word = "cba"
val matcher = "\\b[^a-zA-Z]*(?i)$word[^a-zA-Z]*\\b".toRegex()
val ret = matcher.find(testLink3)?.groupValues
But output of my source code is "cba"
My expected value is string array such as "{cba, cba, cba}".
How to find this value in kotlin language.
You may use
val testLink3 = """cBa#Cba cbA123"""
val word = "cba"
val matcher = "(?i)(?<!\\p{L})$word(?!\\p{L})".toRegex()
println(matcher.findAll(testLink3).map{it.value}.toList() )
println(matcher.findAll(testLink3).count() )
// => [cBa, Cba, cbA]
// => 3
See the online Kotlin demo.
To obtain the list of matches, you need to run the findAll method on the regex object, map the results to values and cast to list:
.findAll(testLink3).map{it.value}.toList()
To count the matches, you may use
matcher.findAll(testLink3).count()
Regex demo
(?i) - case insensitive modifier
(?<!\\p{L}) - a negative lookbehind that fails the match if, immediately to the left of the current location, there is a letter
$word - your word variable value
(?!\\p{L}) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a letter.