Extract matched strings in C++ with regex

Extract matched strings in C++ with regex - c++

I have following test strings.
#5=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$);
#6=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$);
#7=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$);
I need to extract:
"#integer" (that always starts at the beginning of the string) from above strings and store it in a variable.
the string between "(" and ")" from above test string.
Can someone please suggest how I can achieve this in C++ with regex.
I tried following as simple example (it's a loop that processes one line at a time):
std::regex e ("\#[:d:]+");
if (std::regex_match(sLine,e)){
//store it and process it
}
output should be:
#5
and
'xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$ ?? (not sure)

Description
This expression will:
capture the initial # and integer
capture the value between the parentheses
^(\#\d+).*?\(([^)]*)\)
Example
Live Demo
Sample Text
#5=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$);
#6=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$);
#7=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$);
Capture Groups
Group 0 gets the entire matched string
Group 1 gets the # and integer
Group 2 gets the value between the parentheses
[0][0] = #5=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$)
[0][1] = #5
[0][2] = 'xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$
[1][0] = #6=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$)
[1][1] = #6
[1][2] = 'xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$
[2][0] = #7=BUILDING('xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$)
[2][1] = #7
[2][2] = 'xxxcdccx',#5,$,$,$,#21,$,$,.ELEMENT.,$,$,$

Related

How do I replace the nth occurrence of a special character, say, a pipe delimiter with another in Scala?

I'm new to Spark using Scala and I need to replace every nth occurrence of the delimiter with the newline character.
So far, I have been successful at entering a new line after the pipe delimiter.
I'm unable to replace the delimiter itself.
My input string is
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
println(txt.replaceAll(".\\|", "$0\n"))
The above statement generates the following output.
January|
February|
March|
April|
May|
June|
July|
August|
September|
October|
November|
December
I referred to the suggestion at https://salesforce.stackexchange.com/questions/189923/adding-comma-separator-for-every-nth-character but when I enter the number in the curly braces, I only end up adding the newline after 2 characters after the delimiter.
I'm expecting my output to be as given below.
January|February
March|April
May|June
July|August
September|October
November|December
How do I change my regular expression to get the desired output?
Update:
My friend suggested I try the following statement
println(txt.replaceAll("(.*?\\|){2}", "$0\n"))
and this produced the following output
January|February|
March|April|
May|June|
July|August|
September|October|
November|December
Now I just need to get rid of the pipe symbol at the end of each line.

You want to move the 2nd bar | outside of the capture group.
txt.replaceAll("([^|]+\\|[^|]+)\\|", "$1\n")
//val res0: String =
// January|February
// March|April
// May|June
// July|August
// September|October
// November|December
Regex Explained (regex is not Scala)
( - start a capture group
[^|] - any character as long as it's not the bar | character
[^|]+ - 1 or more of those (any) non-bar chars
\\| - followed by a single bar char |
[^|]+ - followed by 1 or more of any non-bar chars
) - close the capture group
\\| - followed by a single bar char (not in capture group)
"$1\n" - replace the entire matching string with just the first $1 capture group ($0 is the entire matching string) followed by the newline char
UPDATE
For the general case of N repetitions, regex becomes a bit more cumbersome, at least if you're trying to do it with a single regex formula.
The simplest thing to do (not the most efficient but simple to code) is to traverse the String twice.
val n = 5
txt.replaceAll(s"(\\w+\\|){$n}", "$0\n")
.replaceAll("\\|\n", "\n")
//val res0: String =
// January|February|March|April|May
// June|July|August|September|October
// November|December

You could first split the string using '|' to get the array of string and then loop through it to perform the logic you want and get the output as required.
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
val out = txt.split("\\|")
var output: String = ""
for(i<-0 until out.length -1 by 2){
val ref = out(i) + "|" + out(i+1) + "\n"
output = output + ref
}
val finalout = output.replaceAll("\"\"","") //just to remove the starting double quote
println(finalout)

Filter a string using regular expression

I tried the following code. However, the result is not what I want.
$strLine = "100.11 Q9"
$sortString = StringRegExp ($strLine,'([0-9\.]{1,7})', $STR_REGEXPARRAYMATCH)
MsgBox(0, "", $sortString[0],2)
The output shows 100.11, but I want 100.11 9. How could I display it this way using a regular expression?

$sPattern = "([0-9\.]+)\sQ(\d+)"
$strLine = "100.11 Q9"
$sortString = StringRegExpReplace($strLine, $sPattern, '\1 \2')
MsgBox(0, "$sortString", $sortString, 2)
$strLine = "100.11 Q9"
$sortString = StringRegExp($strLine, $sPattern, 3); array of global matches.
For $i1 = 0 To UBound($sortString) -1
MsgBox(0, "$sortString[" & $i1 & "]", $sortString[$i1], 2)
Next
The pattern is to get the 2 groups being 100.11 and 9.
The pattern will 1st match the group with any digit and dot until it reach
/s which will match the space. It will then match the Q. The 2nd group
matches any remaining digits.
StringRegExpReplace replaces the whole string with 1st and 2nd groups
separated with a space.
StringRegExp get the 2 groups as 2 array elements.
Choose 1 from the 2 types regexp above of which you prefer.

Extract variables from string with Regex

I'm trying to extract from a string variables with the following format: ${var}
Given this string:
val s = "This is a string with ${var1} and ${var2} and {var3}"
The result should be
List("var1","var2")
This is the attempt, it ends in an exception. What's wrong with this regex?
val pattern = """\${([^\s}]+)(?=})""".r
val s = "This is a string with ${var1} and ${var2} and {var3}"
val vals = pattern.findAllIn(s)
println(vals.toList)
and the exception:
Exception in thread "main" java.util.regex.PatternSyntaxException:
Illegal repetition near index 1 \${([^\s}]+)(?=})

NOTE :- { in regex have special meaning. It denotes range. e.g. a{2,10} denotes match a in between 2 to 10 times. So you need to escape {.
Solution 1
val pattern = """\$\{([^\s}]+)(?=})""".r
You need to access the first capturing group for finding the result and then change it to list.
Solution 2
You can also use lookbehind like
val pattern = """(?<=\$\{)[^\s}]+(?=})""".r
Ideone Demo

Matching specific lengths with regexp in Matlab

String matching question in Matlab.
if i have a matrix
a = ['thehe'];
str = {'the','he'};
match = regexp(a,str);
the output is match =
[1] [1x2 double]
because it found 'he' twice and 'the' once
how can i make it so it looks from left to right of my string a and
only matches 'the' once and 'he' once?

To answer the explicit question, from the documentation for regexp you can specify the once search option:
a = 'thehe';
str = {'the','he'};
match = regexp(a,str, 'once');
Which returns:
match =
[1] [2]
Where match is a 1x2 cell array whose cell value(s) correspond to the first index of the match in a for each cell of str.

I understand from what the ambiguously described details I'v read, that you want the indexes of non-interleaved occurences of the and he, means 1, and 4.
a = ['thehe'];
str = {'the';'[^t]he'};
match = regexp(a,str)
after this print the two results.
a(match{1}:match{1}+2)
ans =
the
and
a(match{2}+1:match{2}+2)
ans =
he
no third occurence !
a(match{3})
??? Index exceeds matrix dimensions.

regex with all components optionals, how to avoid empty matches

I have to process a comma separated string which contains triplets of values and translate them to runtime types,the input looks like:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each substring should be transformed this way:
"1x2y3z" should become Vector3 with x = 1, y = 2, z = 3
"80r160g255b" should become Color with r = 80, g = 160, b = 255
"48h30m50s" should become Time with h = 48, m = 30, s = 50
The problem I'm facing is that all the components are optional (but they preserve order) so the following strings are also valid Vector3, Color and Time values:
"1x3z" Vector3 x = 1, y = 0, z = 3
"255b" Color r = 0, g = 0, b = 255
"1h" Time h = 1, m = 0, s = 0
What I have tried so far?
All components optional
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The A, B and C are replaced with the correct letter for each case, the expression works almost well but it gives twice the expected results (one match for the string and another match for an empty string just after the first match), for example:
"1h1m1s" two matches [1]: "1h1m1s" [2]: ""
"11x50z" two matches [1]: "11x50z" [2]: ""
"11111h" two matches [1]: "11111h" [2]: ""
This isn't unexpected... after all an empty string matches the expression when ALL of the components are empty; so in order to fix this issue I've tried the following:
1 to 3 quantifier
((?:\d+[ABC]){1,3})
But now, the expression matches strings with wrong ordering or even repeated components!:
"1s1m1h" one match, should not match at all! (wrong order)
"11z50z" one match, should not match at all! (repeated components)
"1r1r1b" one match, should not match at all! (repeated components)
As for my last attempt, I've tried this variant of my first expression:
Match from begin ^ to the end $
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
And it works better than the first version but it still matches the empty string plus I should first tokenize the input and then pass each token to the expression in order to assure that the test string could match the begin (^) and end ($) operators.
EDIT: Lookahead attempt (thanks to Casimir et Hippolyte)
After reading and (try to) understanding the regex lookahead concept and with the help of Casimir et Hippolyte answer I've tried the suggested expression:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following test string:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
And the results were amazing! it is able to detect complete valid matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" which weren't intended to be matched at all). Unfortunately it captures emtpy matches everytime a unvalid match is found so a "" is detected just before "1s1m1h", "1h1h1h", "adfank" and "12322134445688", so I modified the Lookahead condition to get the expression below:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
It gets rid of the empty matches in any string which doesn't match (?:\d+[ABC]){1,3}) so the empty matches just before "adfank" and "12322134445688" are gone but the ones just before "1s1m1h", "1h1h1h" are stil detected.
So the question is: Is there any regular expression which matches three triplet values in a given order where all component is optional but should be composed of at least one component and doesn't match empty strings?
The regex tool I'm using is the C++11 one.

Yes, you can add a lookahead at the begining to ensure there is at least one character:
^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
If you need to find this kind of substring in a larger string (so without to tokenize before), you can remove the anchors and use a more explicit subpattern in a lookahead:
(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)
In this case, to avoid false positive (since you are looking for very small strings that can be a part of something else), you can add word-boundaries to the pattern:
\b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Note: in a comma delimited string: (?=\d+[ABC]) can be replaced by (?=[^,])

I think this might do the trick.
I am keying on either the beginning of the string to match ^ or the comma separator , for fix the start of each match: (?:^|,).
Example:
#include <regex>
#include <iostream>
const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~");
int main()
{
std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b";
std::sregex_iterator iter(test.begin(), test.end(), r);
std::sregex_iterator end_iter;
for(; iter != end_iter; ++iter)
std::cout << iter->str(1) << '\n';
}
Output:
1x2y3z
80r160g255b
48h30m50s
1x3z
255b
Is that what you are after?
EDIT:
If you really want to go to town and make empty expressions unmatched then as far as I can tell you have to put in every permutation like this:
const std::string A = "(?:\\d+[xrh])";
const std::string B = "(?:\\d+[ygm])";
const std::string C = "(?:\\d+[zbs])";
const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")");

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extract matched strings in C++ with regex - c++

Related

How do I replace the nth occurrence of a special character, say, a pipe delimiter with another in Scala?

Filter a string using regular expression

Extract variables from string with Regex

Matching specific lengths with regexp in Matlab

regex with all components optionals, how to avoid empty matches

Categories

Resources