regex repeated capturing group captures the last iteration but I need all - regex

Example code:
var reStr = `"(?:\\"|[^"])*"`
var reStrSum = regexp.MustCompile(`(?m)(` + reStr + `)\s*\+\s*(` + reStr + `)\s*\+\s*(` + reStr + `)`)
var str = `"This\nis\ta\\string" +
"Another\"string" +
"Third string"
`
for i, match := range reStrSum.FindAllStringSubmatch(str, -1) {
fmt.Println(match, "found at index", i)
for i, str := range match {
fmt.Println(i, str)
}
}
Output:
["This\nis\ta\\string" +
"Another\"string" +
"Third string" "This\nis\ta\\string" "Another\"string" "Third string"] found at index 0
0 "This\nis\ta\\string" +
"Another\"string" +
"Third string"
1 "This\nis\ta\\string"
2 "Another\"string"
3 "Third string"
E.g. it matches the "sum of strings" and it captures all three strings correctly.
My problem is that I do not want to match the sum of exactly three strings. I want to match all "sum of strings" where the sum can consist of one or more string literals. I have tried to express this with {0,}
var reStr = `"(?:\\"|[^"])*"`
var reStrSum = regexp.MustCompile(`(?m)(` + reStr + `)` + `(?:\s*\+\s*(` + reStr + `)){0,}`)
var str = `
test1("This\nis\ta\\string" +
"Another\"string" +
"Third string summed");
test2("Second string " + "sum");
`
for i, match := range reStrSum.FindAllStringSubmatch(str, -1) {
fmt.Println(match, "found at index", i)
for i, str := range match {
fmt.Println(i, str)
}
}
`)){0,}`)
then I get this result:
["This\nis\ta\\string" +
"Another\"string" +
"Third string summed" "This\nis\ta\\string" "Third string summed"] found at index 0
0 "This\nis\ta\\string" +
"Another\"string" +
"Third string summed"
1 "This\nis\ta\\string"
2 "Third string summed"
["Second string " + "sum" "Second string " "sum"] found at index 1
0 "Second string " + "sum"
1 "Second string "
2 "sum"
Group 0 of the first match contains all three strings (the regexp matches correctly), but there are only two capturing groups in the expression, and the second group only contains the last iteration of the repetition. E.g. "Another\"string" is lost in the process, it cannot be accessed.
Would it be possible to get all iterations of (all repetitions) inside group 2 somehow?
I would also accept any workaround that uses nested loops. But please be aware that I cannot simply replace the {0,} repetition with an outer FindAllStringSubmatch call, because the FindAllStringSubmatch call is already used for iterating over "sums of strings". In other words, I must find the first string sum and also the "Second string sum".

I just found a workaround that will work. I can do two passes. In the first pass, I just match all string literals, and replace them with unique placeholders in the original text. Then the transformed text won't contain any strings, and it becomes much easier to do further processing on it in a second pass.
Something like this:
type javaString struct {
value string
lineno int
}
// First we find all string literals
var placeholder = "JSTR"
var reJavaStringLiteral = regexp.MustCompile(`(?m)("(?:\\"|[^"])*")`)
javaStringLiterals := make([]javaString, 0)
for pos, strMatch := range reJavaStringLiteral.FindAllStringSubmatch(strContent, -1) {
pos = strings.Index(strContent, strMatch[0])
head := strContent[0:pos]
lineno := strings.Count(head, "\n") + 1
javaStringLiterals = append(javaStringLiterals, javaString{value: strMatch[1], lineno: lineno})
}
// Next, we replace all string literals with placeholders.
for i, jstr := range javaStringLiterals {
strContent = strings.Replace(strContent, jstr.value, fmt.Sprintf("%v(%v)", placeholder, i), 1)
}
// Now the transformed text does not contain any string literals.
After the first pass, the original text becomes:
test1(JSTR(1) +
JSTR(2) +
JSTR(3));
test2(JSTR(3) + JSTR(4));
After this step, I can easily look for "JSTR(\d+) + JSTR(\d+) + JSTR(\d+)..." expressions. Now they are easy to find, because the text does not contain any strings (that could otherwise contain practically anything and interfere with regular expressions). These "sum of string" matches can then be re-matched with another FindAllStringSubmatch (in an inner loop) and then I'll get all information that I needed.
This is not a real solution, because it requires writting a lot of code, it is specific to my concrete use case, and does not really answer the original question: allow access to all iterations inside a repeated capturing group.
But the general idea of the workaround might be benefical for somebody who is facing a similar problem.

Related

Why is string.find_first_of behaving this way?

I am trying to make a (assembly) parser which uses a string as a guide for how to cut the text to get the tokens I want.
string s = "$t4,";
string guide = "$!,$!,$!";
int i = 1;
string test =s.substr(0, s.find_first_of(" ,.\t"+to_string(guide[i+1]) ));
cout << test << "\n";
if s = "$t4" then test = "$t"
what I am expecting it to do is test to be "$t4", this works for every other $tX except for specifically the number 4 even though it's not in the (" ,.\t"+to_string(guide[i+1])) string
s.find_first_of(" ,.\t" + std::to_string(guide[i + 1]))
Assuming ASCII, that string will be:
,.\t44
44 is the ASCII value of the , in guide[i + 1].
The first character in "$t4," that it'll find is 4 at position 2, and you then create a substring from 0 and length 2, that is $t.

Using Regex To find total occurrence of &T [duplicate]

I have a string (for example: "Hello there. My name is John. I work very hard. Hello there!") and I am trying to find the number of occurrences of the string "hello there". So far, this is the code I have:
Dim input as String = "Hello there. My name is John. I work very hard. Hello there!"
Dim phrase as String = "hello there"
Dim Occurrences As Integer = 0
If input.toLower.Contains(phrase) = True Then
Occurrences = input.Split(phrase).Length
'REM: Do stuff
End If
Unfortunately, what this line of code seems to do is split the string every time it sees the first letter of phrase, in this case, h. So instead of the result Occurrences = 2 that I would hope for, I actually get a much larger number. I know that counting the number of splits in a string is a horrible way to go about doing this, even if I did get the correct answer, so could someone please help me out and provide some assistance?
Yet another idea:
Dim input As String = "Hello there. My name is John. I work very hard. Hello there!"
Dim phrase As String = "Hello there"
Dim Occurrences As Integer = (input.Length - input.Replace(phrase, String.Empty).Length) / phrase.Length
You just need to make sure that phrase.Length > 0.
the best way to do it is this:
Public Function countString(ByVal inputString As String, ByVal stringToBeSearchedInsideTheInputString as String) As Integer
Return System.Text.RegularExpressions.Regex.Split(inputString, stringToBeSearchedInsideTheInputString).Length -1
End Function
str="Thisissumlivinginsumgjhvgsum in the sum bcoz sum ot ih sum"
b= LCase(str)
array1=Split(b,"sum")
l=Ubound(array1)
msgbox l
the output gives u the no. of occurences of a string within another one.
You can create a Do Until loop that stops once an integer variable equals the length of the string you're checking. If the phrase exists, increment your occurences and add the length of the phrase plus the position in which it is found to the cursor variable. If the phrase can not be found, you are done searching (no more results), so set it to the length of the target string. To not count the same occurance more than once, check only from the cursor to the length of the target string in the Loop (strCheckThisString).
Dim input As String = "hello there. this is a test. hello there hello there!"
Dim phrase As String = "hello there"
Dim Occurrences As Integer = 0
Dim intCursor As Integer = 0
Do Until intCursor >= input.Length
Dim strCheckThisString As String = Mid(LCase(input), intCursor + 1, (Len(input) - intCursor))
Dim intPlaceOfPhrase As Integer = InStr(strCheckThisString, phrase)
If intPlaceOfPhrase > 0 Then
Occurrences += 1
intCursor += (intPlaceOfPhrase + Len(phrase) - 1)
Else
intCursor = input.Length
End If
Loop
You just have to change the input of the split function into a string array and then delare the StringSplitOptions.
Try out this line of code:
Occurrences = input.Split({phrase}, StringSplitOptions.None).Length
I haven't checked this, but I'm thinking you'll also have to account for the fact that occurrences would be too high due to the fact that you're splitting using your string and not actually counting how many times it is in the string, so I think Occurrences = Occurrences - 1
Hope this helps
You could create a recursive function using IndexOf. Passing the string to be searched and the string to locate, each recursion increments a Counter and sets the StartIndex to +1 the last found index, until the search string is no longer found. Function will require optional parameters Starting Position and Counter passed by reference:
Function InStrCount(ByVal SourceString As String, _
ByVal SearchString As String, _
Optional ByRef StartPos As Integer = 0, _
Optional ByRef Count As Integer = 0) As Integer
If SourceString.IndexOf(SearchString, StartPos) > -1 Then
Count += 1
InStrCount(SourceString, _
SearchString, _
SourceString.IndexOf(SearchString, StartPos) + 1, _
Count)
End If
Return Count
End Function
Call function by passing string to search and string to locate and, optionally, start position:
Dim input As String = "Hello there. My name is John. I work very hard. Hello there!"
Dim phrase As String = "hello there"
Dim Occurrences As Integer
Occurrances = InStrCount(input.ToLower, phrase.ToLower)
Note the use of .ToLower, which is used to ignore case in your comparison. Do not include this directive if you do wish comparison to be case specific.
One more solution based on InStr(i, str, substr) function (searching substr in str starting from i position, more info about InStr()):
Function findOccurancesCount(baseString, subString)
occurancesCount = 0
i = 1
Do
foundPosition = InStr(i, baseString, subString) 'searching from i position
If foundPosition > 0 Then 'substring is found at foundPosition index
occurancesCount = occurancesCount + 1 'count this occurance
i = foundPosition + 1 'searching from i+1 on the next cycle
End If
Loop While foundPosition <> 0
findOccurancesCount = occurancesCount
End Function
As soon as there is no substring found (InStr returns 0, instead of found substring position in base string), searching is over and occurances count is returned.
Looking at your original attempt, I have found that this should do the trick as "Split" creates an array.
Occurrences = input.split(phrase).ubound
This is CaSe sensitive, so in your case the phrase should equal "Hello there", as there is no "hello there" in the input
Expanding on Sumit Kumar's simple solution, here it is as a one-line working function:
Public Function fnStrCnt(ByVal str As String, ByVal substr As String) As Integer
fnStrCnt = UBound(Split(LCase(str), substr))
End Function
Demo:
Sub testit()
Dim thePhrase
thePhrase = "Once upon a midnight dreary while a man was in a house in the usa."
If fnStrCnt(thePhrase, " a ") > 1 Then
MsgBox "Found " & fnStrCnt(thePhrase, " a ") & " occurrences."
End If
End Sub 'testit()
I don't know if this is more obvious?
Starting from the beginning of longString check the next characters up to the number characters in phrase, if phrase is not found start looking from the second character etc. If it is found start agin from the current position plus the number of characters in phrase and increment the value of occurences
Module Module1
Sub Main()
Dim longString As String = "Hello there. My name is John. I work very hard. Hello there! Hello therehello there"
Dim phrase As String = "hello There"
Dim occurences As Integer = 0
Dim n As Integer = 0
Do Until n >= longString.Length - (phrase.Length - 1)
If longString.ToLower.Substring(n, phrase.Length).Contains(phrase.ToLower) Then
occurences += 1
n = n + (phrase.Length - 1)
End If
n += 1
Loop
Console.WriteLine(occurences)
End Sub
End Module
I used this in Vbscript, You can convert the same to VB.net as well
Dim str, strToFind
str = "sdfsdf:sdsdgs::"
strToFind = ":"
MsgBox GetNoOfOccurranceOf( strToFind, str)
Function GetNoOfOccurranceOf(ByVal subStringToFind As String, ByVal strReference As String)
Dim iTotalLength, newString, iTotalOccCount
iTotalLength = Len(strReference)
newString = Replace(strReference, subStringToFind, "")
iTotalOccCount = iTotalLength - Len(newString)
GetNoOfOccurranceOf = iTotalOccCount
End Function
I know this thread is really old, but I got another solution too:
Function countOccurencesOf(needle As String, s As String)
Dim count As Integer = 0
For i As Integer = 0 to s.Length - 1
If s.Substring(i).Startswith(needle) Then
count = count + 1
End If
Next
Return count
End Function

regular expression to match a ascii character

I want to match a regular expression for the string
2=abc\u000148=123\u0001
Explanation
Key value pairs separated by SOH(\u0001) characeter
Key - Number
Value can be string of number ,alphabets,decimals
key and value are separated by "="
The regex I tried is
[0-9]=.*[u0001]+
but it does not matches properly
Update
I have a list of numbers val num =Seq(2,3,4)
Instead of finding I want to remove the matches from the string
keys for which I want to replace is from values inside list num
Input
2=abc\u000148=123\u00013=def\u0001
Output It is the filtered string
148=123\u0001 ,where keys which match value 2 and 3 are removed from list
object Main extends App {
val s = "2=abc\u000148=123\u00013=def\u0001"
val num = Seq(2,3)
for (e <- num) {
val p = s"(\\$e+)=([^\u0001]*)".r
test(p)
}
private def test(p: Regex) = {
p.findAllIn(s).matchData foreach {
m => println(m.group(1) + " : " + m.group(2))
}
}
}
You need to build the pattern dynamically like this:
s"\\b(?:${num.mkString("|")})=[^\\u0001]*\\u0001*"
Details
\b - a word boundary
(?:num1|num2...|numN) - any of the values in the num variable
= - an equal sign
[^\u0001]* - zero or more chars other than a SOH char (a char with the decimal code of 1)
\u0001* - zero or more SOH chars.
See a Scala demo:
val num = Seq(2,3)
val s = "1041=pqr\u000148=xyz\u000122=8\u00012=abc\u000148=123\u00013=def\u0001"
val pattern = s"\\b(?:${num.mkString("|")})=[^\\u0001]*\\u0001*"
// println(pattern) // => \b(?:2|3)=[^\u0001]*\u0001*
println(s.replaceAll(pattern, ""))
// => 1041=pqr\u000148=xyz\u000122=8\u000148=123\u0001

Getting the index of a slice

I want to do some processing on a string in Scala. The first stage of that is finding the index of articles such as: "A ", " A ", "a ", " a ". I am trying to do that like this:
"A house is in front of us".indexOfSlice("\\s+[Aa] ")
I think this should return 0, as the substring is first matched in the first position of the string.
However, this returns -1.
Why does it return -1? Is the regex I am using incorrect?
The other answers as I type this are just missing the point. Your problem is that indexOfSlice doesn't take a regexp, but a sub-sequence to seach for in the sequence. So fixing the regexp won't help at all.
Try this:
val pattern = "\\b[Aa]\\b".r.unanchored
for (mo <- pattern.findAllMatchIn("A house is in front of us, a house is in front of us all")) {
println("pattern starts at " + mo.start)
}
//> pattern starts at 0
//| pattern starts at 27
(with fixed regex, too)
Edit: counter-example for the popular but wrong suggestion of "\\s*[Aa] "
val pattern2 = "\\s*[Aa] ".r.unanchored
for (mo <- pattern2.findAllMatchIn("The agenda is hidden")) {
println("pattern starts at " + mo.start)
}
//> pattern starts at 9
I see a mistake in your regex. your regex is searching for
at least once space (\s+)
a letter (either A or a)
but string you are matching doesn't contain space in beginning. that's why It's not returning you index 0 but -1.
you could write your regex as "^\\s*[Aa] "
Here is example:
val text = "A house is in front of us";
val matcher = Pattern.compile("^\\s*[Aa] ").matcher(text)
var idx = 0;
if(matcher.find()){
idx = matcher.start()
}
println(idx)
it should return 0 as expected.

Excel - Extract all occurrences of a String Pattern + the subsequent 4 characters after the pattern match from a cell

I am struggling with a huge Excel sheet where I need to extract from a certain cell (A1),
all occurrences of a string pattern e.g. "TCS" + the following 4 characters after the pattern match e.g. TCS1234 comma-separated into another cell (B1).
Example:
Cell A1 contains the following string:
HRS164, SRS3439(s), SRS3440(s), SRS3441(s), SRS3442(s), SRS3443(s), SRS3444(s), SRS3445(s), SRS3449(s), SRS3450(s), SRS3451(s), SRS3452(s), SYSBASE.SSS300(s), TCS3715(s), TCS3716(s), TCS3717(s), TCS4037(s), TCS1234
All TCS-Numbers shall be comma-separated in B1:
TCS3715, TCS3716, TCS3717, TCS4037, TCS1234
It is not necessary to also extract the followed "(s)".
Could someone please help me (excel rookie) with this challenge?
TIA Erika
Here is what I would use for something like that: also a user defined function:
Function GetTCS(TheString)
For Each TItem In Split(TheString, ", ")
If Left(TItem, 3) = "TCS" Then GetTCS = GetTCS & TItem & " "
Next
GetTCS = Replace(Trim(GetTCS), " ", ", ")
End Function
This returns "TCS3715(s), TCS3716(s), TCS3717(s), TCS4037(s), TCS1234" out of your string. If you don't know how to create a user defined function, just ask, it's pretty straight forward and I'd be happy to show you. Hope this helps.
Try the following User Defined Function:
Public Function Xtract(r As Range) As String
Dim s As String, L As Long, U As Long
Dim msg As String, i As Long
s = Replace(r(1).Text, " ", "")
ary = Split(s, ",")
L = LBound(ary)
U = UBound(ary)
Xtract = ""
msg = ""
For i = L To U
If Left(ary(i), 3) = "TCS" Then
If msg = "" Then
msg = Left(ary(i), 7)
Else
msg = msg & "," & Left(ary(i), 7)
End If
End If
Next i
Xtract = msg
End Function
If the TCS-parts are always at the end of the string as in your example, I would use (in B1):
=REPLACE(A1,1,FIND("TCS",A1)-1,"")