Regular Expressions matching difficulty - regex

My current regex:
([\d]*)([^\d]*[\d][a-z]*-[\d]*)([\d][a-z?])(.?)
Right so I am attempting to make regex match a string based on: a count that can be any amount of number from 0-1million then followed by a number then sometimes a letter then - then any number for numbers followed by the same number and sometimes a letter then sometimes a letter. example of strings it should match:
1921-1220104081741b
192123212a-1220234104081742ab
an example of what it should return based on above (this is 2 examples it shouldn't read both lines.)
(192) (1-122010408174) (1) (b)
(19212321) (2a-122023410408174) (2a) (b)
My current regex works with the second one but it returns (1b) in the first when I would like it to return (1) (b) but also return (2a) in the case of the second one or the case of:
1926h-1220104081746h Should Return: (192) (6h-122010408174) (6h)
Not 100% sure if its possible, sense I'm fairly new to regex. For reference I'm doing this in excel-vba if there is any other way to do this easier.

You could capture the character(s) before the dash character, and then back reference that match.
In the expression below, \3 would match what was matched by the 3rd capturing group:
(\d*)((\d[a-z]*)-\d*)(\3)([a-z])?
Example Here
Output after merging the capture groups:
1921-1220104081741b
(192) (1-122010408174) (1) (b)
192123212a-1220234104081742ab
(19212321) (2a-122023410408174) (2a) (b)
1926h-1220104081746h
(192) (6h-122010408174) (6h)
Example:
Disregard the JS. Here is the output after merging the capture groups:
var strings = ['1921-1220104081741b', '192123212a-1220234104081742ab', '1926h-1220104081746h'], exp = /(\d*)((\d[a-z]*)-\d*)(\3)([a-z])?/;
strings.forEach(function(str) {
var m = str.match(exp);
snippet.log(str);
snippet.log('(' + m[1] + ') ('+ m[2] + ') (' + m[4] + ') (' + (m[5]||'') + ')');
snippet.log('---');
});
<script src="http://tjcrowder.github.io/simple-snippets-console/snippet.js"></script>

I think what you are saying with "followed by the same number" is that the piece right before the dash is repeated as your third capture group. I would suggest implementing this by splitting up your second capture group and then using a backreference:
([\d]*)([\d][a-z]*)-([\d]*)(\2)(.?)
For your three examples:
1921-1220104081741b
192123212a-1220234104081742ab
1926h-1220104081746h
This results in:
(192) (1) - (122010408174) (1) (b)
(19212321) (2a) - (122023410408174) (2a) (b)
(192) (6h) - (122010408174) (6h) ()
...and you can join the two middle groups back together to get the hyphenated term you wanted.

Related

How to extract the operands on both sides of "==" using regex?

Language and package
python3.8, regex
Description
The inputs and wanted outputs are listed as following:
if (programWorkflowState.getTerminal(1, 2) == Boolean.TRUE) {
Want: programWorkflowState.getTerminal(1, 2) and Boolean.TRUE
boolean ignore = !_isInStatic.isEmpty() && (_isInStatic.peek() == 3) && isAnonymous;
Want: _isInStatic.peek() and 3
boolean b = (num1 * ( 2 + num2)) == value;
Want: (num1 * ( 2 + num2)) and value
My current regex
((?:\((?:[^\(\)]|(?R))*\)|[\w\.])+)\s*==\s*((?:\((?:[^\(\)]|(?R))*\)|[\w\.])+)
This pattern want to match \((?:[^\(\)]|(?R))*\) or [\w\.] on both side of "=="
Result on regex101.com
Problem: It failed to match the recursive part (num1 * ( 2 + num2)).
The explanation of the recursive pattern \((?:m|(?R))*\) is here
But if I only use the recursive pattern, it succeeded to match (num1 * ( 2 + num2)) as the image shows.
What's the right regex to achieve my purpose?
The \((?:m|(?R))*\) pattern contains a (?R) construct (equal to (?0) subroutine) that recurses the entire pattern.
You need to wrap the pattern you need to recurse with a group and use a subroutine instead of (?R) recursion construct, e.g. (?P<aux>\((?:m|(?&aux))*\)) to recurse a pattern inside a longer one.
You can use
((?:(?P<aux1>\((?:[^()]++|(?&aux1))*\))|[\w.])++)\s*[!=]=\s*((?:(?&aux1)|[\w.])+)
See this regex demo (it takes just 6875 steps to match the string provided, yours takes 13680)
Details
((?:(?P<aux1>\((?:[^()]++|(?&aux1))*\))|[\w.])++) - Group 1, matches one or more occurrences (possessively, due to ++, not allowing backtracking into the pattern so that the regex engine could not re-try matching a string in another way if the subsequent patterns fail to match)
(?P<aux1>\((?:[^()]++|(?&aux1))*\)) - an auxiliary group "aux1" that matches (, then zero or more occurrences of either 1+ chars other than ( and ) or the whole Group "aux1" pattern, and then a )
| - or
[\w.] - a letter, digit, underscore or .
\s*[!=]=\s* - != or == with zero or more whitespace on both ends
((?:(?&aux1)|[\w.])+) - Group 2: one or more occurences of Group "aux" pattern or a letter, digit, underscore or ..

RegEx for matching 3 alphabets and 1-2 digits

I am trying to write a regular expression to find a match in a text having at least 100 characters. The match should be like this - Any sub string within a string that contains at least 3 alphabet to begin with, at least 1 digit following it and a maximum of 2 digits following the 3 letters.
Examples -
abcjkhklfdpdn24hjkk - In this case I want to extract pdn24
hjdksfkpdf1lkjk - In this case I want to extract pdf1
hjgjdkspdg34kjfs dhj khk678jkfhlds1 - In this case I want both pdg34 and lds1
How do I write a regex for this ? The length of the starting letters for a match is always 3 and the digits length can be either 1 or 2 (not more not less)
This is what works if there are 2 digits after the 3 letter string.
[A-Za-z]{3}[0-9]{2}
But the length of the digits can vary between 1 and 2. How do I include the varying length in the regex?
The expression we wish to design is quite interesting. We can first add your original expression with a slight modification in a capturing group, then we should think of left and right boundaries around it. For instance, on the right we might want to use \D:
([A-Za-z]{3}[0-9]{1,2})\D
DEMO 1
We can surely define an exact restricted expression. However, this might just work.
Based on Cary Swoveland's advice, we can also use this expression, which is much better:
\p{L}{3}\d{1,2}(?!\d)
Test
re = /([A-Za-z]{3}[0-9]{1,2})\D/m
str = 'abcjkhklfdpdn24hjkk
hjdksfkpdf1lkjk
hjgjdkspdg34kjfs dhj khk678jkfhlds1 '
# Print the match result
str.scan(re) do |match|
puts match.to_s
end
This script shows how the capturing group works:
const regex = /([A-Za-z]{3}[0-9]{1,2})\D/gm;
const str = `abcjkhklfdpdn24hjkk
hjdksfkpdf1lkjk
hjgjdkspdg34kjfs dhj khk678jkfhlds1 `;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
At least 3 alphabets: [a-zA-Z]{3,}
1 or 2 digits (not more not less): [0-9]{1,2}
This gives us:
/[a-zA-Z]{3,}[0-9]{1,2}/

Looping over brackets with regex

Regex extracting 99% of desired result.
This is my line:
Customer Service Representative (CS) (TM PM *) **
*Can have more parameters. Example (TM PM TR) etc
**Can have more parenthesis. Example (TM PM) (RI) (AB CD) etc
Except for the first bracket (CS in this case) which is group 1, I can have any number of parenthesis and any number of parameters within those parenthesis in group 2.
My attempt yields the desired result, but with brackets
(\(.*?\))\s*(\(.*?\).*)
My result:
My desired result:
group 1 : CS
group 2 : if gg yiy rt jfjfj jhfjh uigtu
I want help on removing those parenthesis from the result.
My attempt:
\((.*?)\)\s*\((.*?\).*)
which gives me
Can someone help me with this? I need to remove all the brackets from group 2 as well. I have been at it for a long time but can't figure out a way. Thank you.
You can't match disjoint sections of text using a single match operation. When you need to repeat a group, there is no way to even use a replace approach with capturing groups.
You need a post-process step to remove ( and ) from Group 2 value.
So, after you get your matches with the current approach, remove all ( and ) from the Group 2 value with
Group2value = Group2value.Replace("(", "").Replace(")", "");
Here is one approach which uses string splitting along with the base string functions:
string input = "(CS) (if gg yiy rt) (jfjfj) (jhfjh uigtu)";
string[] parts = Regex.Split(input, "\\) \\(");
string grp1 = parts[0].Replace("(", "");
parts[0] = "";
parts[parts.Length - 1] = parts[parts.Length - 1].Replace(")", "");
string grp2 = string.Join(" ", parts).Trim();
Console.WriteLine(grp1);
Console.WriteLine(grp2);
CS
if gg yiy rt jfjfj jhfjh uigtu

RegEx for matching the first {N} chars and last {M} chars

I'm having an issue filtering tags in Grafana with an InfluxDB backend. I'm trying to filter out the first 8 characters and last 2 of the tag but I'm running into a really weird issue.
Here are some of the names...
GYPSKSVLMP2L1HBS135WH
GYPSKSVLMP2L2HBS135WH
RSHLKSVLMP1L1HBS045RD
RSHLKSVLMP35L1HBS135WH
RSHLKSVLMP35L2HBS135WH
only want to return something like this:
MP8L1HBS225
MP24L2HBS045
I first started off using this expression:
[MP].*
But it only returns the following out of 148:
PAYNKSVLMP27L1HBS045RD
PAYNKSVLMP27L1HBS135WH
PAYNKSVLMP27L1HBS225BL
PAYNKSVLMP27L1HBS315BR
The pattern [MP].* Matches either a M or P and then matches any char until the end of the string not taking any char, digit or quantifing number afterwards into account.
If you want to match MP and the value does not end on a digit but the last in the match should be a digit, you could use:
MP[A-Z0-9]+[0-9]
Regex demo
If lookaheads are supported you might also use:
MP[A-Z0-9]+(?=[A-Z0-9]{2}$)
Regex demo
You may not even want to touch MP. You can simply define a left and right boundary, just like your question asks, and swipe everything in between which might be faster, maybe an expression similar to:
(\w{8})(.*)(\w{2})
which you can simply call it using $2. That is the second capturing group, just to be easy to replace.
Graph
This graph shows how the expression would work:
Performance
This JavaScript snippet shows the performance of this expression using a simple 1-million times for loop.
repeat = 1000000;
start = Date.now();
for (var i = repeat; i >= 0; i--) {
var string = "RSHLKSVLMP35L2HBS135WH";
var regex = /^(\w{8})(.*)(\w{2})$/g;
var match = string.replace(regex, "$2");
}
end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");
Try Regex: (?<=\w{8})\w+(?=\w{2})
Demo

use regular expression to find and replace but only every 3 characters for DNA sequence

Is it possible to do a find/replace using regular expressions on a string of dna such that it only considers every 3 characters (a codon of dna) at a time.
for example I would like the regular expression to see this:
dna="AAACCCTTTGGG"
as this:
AAA CCC TTT GGG
If I use the regular expressions right now and the expression was
Regex.Replace(dna,"ACC","AAA") it would find a match, but in this case of looking at 3 characters at a time there would be no match.
Is this possible?
Why use a regex? Try this instead, which is probably more efficient to boot:
public string DnaReplaceCodon(string input, string match, string replace) {
if (match.Length != 3 || replace.Length != 3)
throw new ArgumentOutOfRangeException();
var output = new StringBuilder(input.Length);
int i = 0;
while (i + 2 < input.Length) {
if (input[i] == match[0] && input[i+1] == match[1] && input[i+2] == match[2]) {
output.Append(replace);
} else {
output.Append(input[i]);
output.Append(input[i]+1);
output.Append(input[i]+2);
}
i += 3;
}
// pick up trailing letters.
while (i < input.Length) output.Append(input[i]);
return output.ToString();
}
Solution
It is possible to do this with regex. Assuming the input is valid (contains only A, T, G, C):
Regex.Replace(input, #"\G((?:.{3})*?)" + codon, "$1" + replacement);
DEMO
If the input is not guaranteed to be valid, you can just do a check with the regex ^[ATCG]*$ (allow non-multiple of 3) or ^([ATCG]{3})*$ (sequence must be multiple of 3). It doesn't make sense to operate on invalid input anyway.
Explanation
The construction above works for any codon. For the sake of explanation, let the codon be AAA. The regex will be \G((?:.{3})*?)AAA.
The whole regex actually matches the shortest substring that ends with the codon to be replaced.
\G # Must be at beginning of the string, or where last match left off
((?:.{3})*?) # Match any number of codon, lazily. The text is also captured.
AAA # The codon we want to replace
We make sure the matches only starts from positions whose index is multiple of 3 with:
\G which asserts that the match starts from where the previous match left off (or the beginning of the string)
And the fact that the pattern ((?:.{3})*?)AAA can only match a sequence whose length is multiple of 3.
Due to the lazy quantifier, we can be sure that in each match, the part before the codon to be replaced (matched by ((?:.{3})*?) part) does not contain the codon.
In the replacement, we put back the part before the codon (which is captured in capturing group 1 and can be referred to with $1), follows by the replacement codon.
NOTE
As explained in the comment, the following is not a good solution! I leave it in so that others will not fall for the same mistake
You can usually find out where a match starts and ends via m.start() and m.end(). If m.start() % 3 == 0 you found a relevant match.