Looping over brackets with regex - regex

Regex extracting 99% of desired result.
This is my line:
Customer Service Representative (CS) (TM PM *) **
*Can have more parameters. Example (TM PM TR) etc
**Can have more parenthesis. Example (TM PM) (RI) (AB CD) etc
Except for the first bracket (CS in this case) which is group 1, I can have any number of parenthesis and any number of parameters within those parenthesis in group 2.
My attempt yields the desired result, but with brackets
(\(.*?\))\s*(\(.*?\).*)
My result:
My desired result:
group 1 : CS
group 2 : if gg yiy rt jfjfj jhfjh uigtu
I want help on removing those parenthesis from the result.
My attempt:
\((.*?)\)\s*\((.*?\).*)
which gives me
Can someone help me with this? I need to remove all the brackets from group 2 as well. I have been at it for a long time but can't figure out a way. Thank you.

You can't match disjoint sections of text using a single match operation. When you need to repeat a group, there is no way to even use a replace approach with capturing groups.
You need a post-process step to remove ( and ) from Group 2 value.
So, after you get your matches with the current approach, remove all ( and ) from the Group 2 value with
Group2value = Group2value.Replace("(", "").Replace(")", "");

Here is one approach which uses string splitting along with the base string functions:
string input = "(CS) (if gg yiy rt) (jfjfj) (jhfjh uigtu)";
string[] parts = Regex.Split(input, "\\) \\(");
string grp1 = parts[0].Replace("(", "");
parts[0] = "";
parts[parts.Length - 1] = parts[parts.Length - 1].Replace(")", "");
string grp2 = string.Join(" ", parts).Trim();
Console.WriteLine(grp1);
Console.WriteLine(grp2);
CS
if gg yiy rt jfjfj jhfjh uigtu

Related

How do I replace the nth occurrence of a special character, say, a pipe delimiter with another in Scala?

I'm new to Spark using Scala and I need to replace every nth occurrence of the delimiter with the newline character.
So far, I have been successful at entering a new line after the pipe delimiter.
I'm unable to replace the delimiter itself.
My input string is
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
println(txt.replaceAll(".\\|", "$0\n"))
The above statement generates the following output.
January|
February|
March|
April|
May|
June|
July|
August|
September|
October|
November|
December
I referred to the suggestion at https://salesforce.stackexchange.com/questions/189923/adding-comma-separator-for-every-nth-character but when I enter the number in the curly braces, I only end up adding the newline after 2 characters after the delimiter.
I'm expecting my output to be as given below.
January|February
March|April
May|June
July|August
September|October
November|December
How do I change my regular expression to get the desired output?
Update:
My friend suggested I try the following statement
println(txt.replaceAll("(.*?\\|){2}", "$0\n"))
and this produced the following output
January|February|
March|April|
May|June|
July|August|
September|October|
November|December
Now I just need to get rid of the pipe symbol at the end of each line.
You want to move the 2nd bar | outside of the capture group.
txt.replaceAll("([^|]+\\|[^|]+)\\|", "$1\n")
//val res0: String =
// January|February
// March|April
// May|June
// July|August
// September|October
// November|December
Regex Explained (regex is not Scala)
( - start a capture group
[^|] - any character as long as it's not the bar | character
[^|]+ - 1 or more of those (any) non-bar chars
\\| - followed by a single bar char |
[^|]+ - followed by 1 or more of any non-bar chars
) - close the capture group
\\| - followed by a single bar char (not in capture group)
"$1\n" - replace the entire matching string with just the first $1 capture group ($0 is the entire matching string) followed by the newline char
UPDATE
For the general case of N repetitions, regex becomes a bit more cumbersome, at least if you're trying to do it with a single regex formula.
The simplest thing to do (not the most efficient but simple to code) is to traverse the String twice.
val n = 5
txt.replaceAll(s"(\\w+\\|){$n}", "$0\n")
.replaceAll("\\|\n", "\n")
//val res0: String =
// January|February|March|April|May
// June|July|August|September|October
// November|December
You could first split the string using '|' to get the array of string and then loop through it to perform the logic you want and get the output as required.
val txt = "January|February|March|April|May|June|July|August|September|October|November|December"
val out = txt.split("\\|")
var output: String = ""
for(i<-0 until out.length -1 by 2){
val ref = out(i) + "|" + out(i+1) + "\n"
output = output + ref
}
val finalout = output.replaceAll("\"\"","") //just to remove the starting double quote
println(finalout)

How to find any non-digit characters using RegEx in ABAP

I need a Regular Expression to check whether a value contains any other characters than digits between 0 and 9.
I also want to check the length of the value.
The RegEx I´ve made: ^([0-9]\d{6})$
My test value is: 123Z45 and 123456
The ABAP code:
FIND ALL OCCURENCES OF REGEX '^([0-9]\d{6})$' IN L_VALUE RESULTS DATA(LT_RESULTS).
I´m expecting a result in LT_RESULTS, when I´m testing the first test value '123Z45', because there is a non-digit character.
But LT_RESULTS is in nearly every test case empty.
Your expression ^([0-9]\d{6})$ translates to:
^ - start of input
( - begin capture group
[0-9] - a character between 0 and 9
\d{6} - six digits (digit = character between 0 and 9)
) - end capture group
$ - end of input
So it will only match 1234567 (7 digit strings), not 123456, or 123Z45.
If you just need to find a string that contains non digits you could use the following instead: ^\d*[^\d]+\d*$
* - previous element may occur zero, one or more times
[^\d] - ^ right after [ means "NOT", i.e. any character which is not a digit
+ - previous element may occur one or more times
Example:
const expression = /^\d*[^\d]+\d*$/;
const inputs = ['123Z45', '123456', 'abc', 'a21345', '1234f', '142345'];
console.log(inputs.filter(i => expression.test(i)));
You can also use this character class if you want to extract non-digit group:
DATA(l_guid) = '0074162D8EAA549794A4EF38D9553990680B89A1'.
DATA(regx) = '[[:alpha:]]+'.
DATA(substr) = match( val = l_guid
regex = regx
occ = 1 ).
It finds a first occured non-digit group of characters and shows it.
If you want to just check if they are exists or how much of them reside in your string, count built-in function is your friend:
DATA(how_many) = count( val = l_guid regex = regx ).
DATA(yes) = boolc( count( val = l_guid regex = regx ) > 0 ).
Match and count exist since ABAP 7.50.
If you don't need a Regular Expression for something more complex, ABAP has some nice comparison operators CO (Contains Only), CA, NA etc for you. Something like:
IF L_VALUE CO '0123456789' AND STRLEN( L_VALUE ) = 6.

Filter a string using regular expression

I tried the following code. However, the result is not what I want.
$strLine = "100.11 Q9"
$sortString = StringRegExp ($strLine,'([0-9\.]{1,7})', $STR_REGEXPARRAYMATCH)
MsgBox(0, "", $sortString[0],2)
The output shows 100.11, but I want 100.11 9. How could I display it this way using a regular expression?
$sPattern = "([0-9\.]+)\sQ(\d+)"
$strLine = "100.11 Q9"
$sortString = StringRegExpReplace($strLine, $sPattern, '\1 \2')
MsgBox(0, "$sortString", $sortString, 2)
$strLine = "100.11 Q9"
$sortString = StringRegExp($strLine, $sPattern, 3); array of global matches.
For $i1 = 0 To UBound($sortString) -1
MsgBox(0, "$sortString[" & $i1 & "]", $sortString[$i1], 2)
Next
The pattern is to get the 2 groups being 100.11 and 9.
The pattern will 1st match the group with any digit and dot until it reach
/s which will match the space. It will then match the Q. The 2nd group
matches any remaining digits.
StringRegExpReplace replaces the whole string with 1st and 2nd groups
separated with a space.
StringRegExp get the 2 groups as 2 array elements.
Choose 1 from the 2 types regexp above of which you prefer.

R regular expression issue

I have a dataframe column including pages paths :
pagePath
/text/other_text/123-some_other_txet-4571/text.html
/text/other_text/another_txet/15-some_other_txet.html
/text/other_text/25189-some_other_txet/45112-text.html
/text/other_text/text/text/5418874-some_other_txet.html
/text/other_text/text/text/some_other_txet-4157/text.html
What I want to do is to extract the first number after a /, for example 123 from each row.
To solve this problem, I tried the following :
num = gsub("\\D"," ", mydata$pagePath) /*to delete all characters other than digits */
num1 = gsub("\\s+"," ",num) /*to let only one space between numbers*/
num2 = gsub("^\\s","",num1) /*to delete the first space in my string*/
my_number = gsub( " .*$", "", num2 ) /*to select the first number on my string*/
I thought that what's that I wanted, but I had some troubles, especially with rows like the last row in the example : /text/other_text/text/text/some_other_txet-4157/text.html
So, what I really want is to extract the first number after a /.
Any help would be very welcome.
You can use the following regex with gsub:
"^(?:.*?/(\\d+))?.*$"
And replace with "\\1". See the regex demo.
Code:
> s <- c("/text/other_text/123-some_other_txet-4571/text.html", "/text/other_text/another_txet/15-some_other_txet.html", "/text/other_text/25189-some_other_txet/45112-text.html", "/text/other_text/text/text/5418874-some_other_txet.html", "/text/other_text/text/text/some_other_txet-4157/text.html")
> gsub("^(?:.*?/(\\d+))?.*$", "\\1", s, perl=T)
[1] "123" "15" "25189" "5418874" ""
The regex will match optionally (with a (?:.*?/(\\d+))? subpattern) a part of string from the beginning till the first / (with .*?/) followed with 1 or more digits (capturing the digits into Group 1, with (\\d+)) and then the rest of the string up to its end (with .*$).
NOTE that perl=T is required.
with stringr str_extract, your code and pattern can be shortened to:
> str_extract(s, "(?<=/)\\d+")
[1] "123" "15" "25189" "5418874" NA
>
The str_extract will extract the first 1 or more digits if they are preceded with a / (the / itself is not returned as part of the match since it is a lookbehind subpattern, a zero width assertion, that does not put the matched text into the result).
Try this
\/(\d+).*
Demo
Output:
MATCH 1
1. [26-29] `123`
MATCH 2
1. [91-93] `15`
MATCH 3
1. [132-137] `25189`
MATCH 4
1. [197-204] `5418874`

Regular Expressions matching difficulty

My current regex:
([\d]*)([^\d]*[\d][a-z]*-[\d]*)([\d][a-z?])(.?)
Right so I am attempting to make regex match a string based on: a count that can be any amount of number from 0-1million then followed by a number then sometimes a letter then - then any number for numbers followed by the same number and sometimes a letter then sometimes a letter. example of strings it should match:
1921-1220104081741b
192123212a-1220234104081742ab
an example of what it should return based on above (this is 2 examples it shouldn't read both lines.)
(192) (1-122010408174) (1) (b)
(19212321) (2a-122023410408174) (2a) (b)
My current regex works with the second one but it returns (1b) in the first when I would like it to return (1) (b) but also return (2a) in the case of the second one or the case of:
1926h-1220104081746h Should Return: (192) (6h-122010408174) (6h)
Not 100% sure if its possible, sense I'm fairly new to regex. For reference I'm doing this in excel-vba if there is any other way to do this easier.
You could capture the character(s) before the dash character, and then back reference that match.
In the expression below, \3 would match what was matched by the 3rd capturing group:
(\d*)((\d[a-z]*)-\d*)(\3)([a-z])?
Example Here
Output after merging the capture groups:
1921-1220104081741b
(192) (1-122010408174) (1) (b)
192123212a-1220234104081742ab
(19212321) (2a-122023410408174) (2a) (b)
1926h-1220104081746h
(192) (6h-122010408174) (6h)
Example:
Disregard the JS. Here is the output after merging the capture groups:
var strings = ['1921-1220104081741b', '192123212a-1220234104081742ab', '1926h-1220104081746h'], exp = /(\d*)((\d[a-z]*)-\d*)(\3)([a-z])?/;
strings.forEach(function(str) {
var m = str.match(exp);
snippet.log(str);
snippet.log('(' + m[1] + ') ('+ m[2] + ') (' + m[4] + ') (' + (m[5]||'') + ')');
snippet.log('---');
});
<script src="http://tjcrowder.github.io/simple-snippets-console/snippet.js"></script>
I think what you are saying with "followed by the same number" is that the piece right before the dash is repeated as your third capture group. I would suggest implementing this by splitting up your second capture group and then using a backreference:
([\d]*)([\d][a-z]*)-([\d]*)(\2)(.?)
For your three examples:
1921-1220104081741b
192123212a-1220234104081742ab
1926h-1220104081746h
This results in:
(192) (1) - (122010408174) (1) (b)
(19212321) (2a) - (122023410408174) (2a) (b)
(192) (6h) - (122010408174) (6h) ()
...and you can join the two middle groups back together to get the hyphenated term you wanted.