Positive lookahead + overlapping matches regex - regex

I'm looking for a regex to match all % that are not followed by a valid 2-characters hex code (2 characters in a-fA-F0-9). I came up with (%)(?=([0-9a-fA-F][^0-9a-fA-F]|[^0-9a-fA-F])) which works well but is not supported in golang, because of the positive lookahead (?=).
How can I translate it (or maybe make it simpler?), so that it works with go?
For example, given the string %d%2524e%25f%255E00%%%252611%25, it should match the first % and the first two ones of the %%% substring.
ie: https://regex101.com/r/y0YQ1I/2

I only tried this on regex101 (marked golang regex), but it seems that it works as expected:
%[0-9a-fA-F][0-9a-fA-F]|(%)
or simpler:
%[0-9a-fA-F]{2}|(%)

The real challenge here is that the matches at position 19 and 20 are overlapping, which means we can't use any of the go builtin "FindAll..." functions since they only find non-overlapping matches. This means that we've got to match the regex repeatedly against substrings starting after subsequent match indices if we want to find them all.
For the regex itself I've used a non-capturing group (?:...) instead of a lookahead assertion. Additionally, the regex will also match percent-signs at the end of the string, since they cannot be followed by two hex digits:
func findPlainPercentIndices(s string) []int {
re := regexp.MustCompile(`%(?:[[:xdigit:]][[:^xdigit:]]|[[:^xdigit:]]|$)`)
indices := []int{}
idx := 0
for {
m := re.FindStringIndex(s[idx:])
if m == nil {
break
}
nextidx := idx + m[0]
indices = append(indices, nextidx)
idx = nextidx + 1
}
return indices
}
func main() {
str := "%d%2524e%25f%255E00%%%252611%25%%"
// 012345678901234567890123456789012
// 0 1 2 3
fmt.Printf("OK: %#v\n", findPlainPercentIndices(str))
// OK: []int{0, 19, 20, 31, 32}
}

Related

Go regex find substring

I have a string:
s := "root 1 12345 /root/pathtomyfolder/jdk/jdk.1.8.0.25 org.catalina.startup"
I need to grep the version number to a string
Tried,
var re = regexp.MustCompile(`jdk.*`)
func main() {
matches := re.FindStringSubmatch(s)
fmt.Printf ("%q", matches)
}
You need to specify capturing groups to extract submatches, as described in the package overview:
If 'Submatch' is present, the return value is a slice identifying the
successive submatches of the expression. Submatches are matches of
parenthesized subexpressions (also known as capturing groups) within
the regular expression, numbered from left to right in order of
opening parenthesis. Submatch 0 is the match of the entire expression,
submatch 1 the match of the first parenthesized subexpression, and so
on.
Something along the lines of the following:
func main() {
var re = regexp.MustCompile(`jdk\.([^ ]+)`)
s := "root 1 12345 /root/pathtomyfolder/jdk/jdk.1.8.0.25 org.catalina.startup"
matches := re.FindStringSubmatch(s)
fmt.Printf("%s", matches[1])
// Prints: 1.8.0.25
}
You'll of course want to check whether there actually is a submatch, or matches[1] will panic.
if you need the value in a string variable to be used / manipulated elsewhere, you could add the following in the given example Marc gave:
a := fmt.Sprintf("%s", matches[1])

Exclude quantitizer from regular expression`

I have a quantifier regular expression that matches a 5digit code [0-9]{5}.
How can I exclude any matched of the above quantifier?
I tried [^([0-9]{5})] but it seems it doesn't work.
Test data follows:
including:
12345678875645 (will be matched)
pppppaaaaa (will be matched)
52p26 (will be matched)
123 (will be matched)
excluding:
12345 (won't be matched)
try this
^(\d{1,4}|\d{6,})$
This won't match numbers with exactly 5 digits
demo here: https://regex101.com/r/sHvRMA/1
You can use a negative look ahead:
/(?!^[0-9]{5}$)^.+$/
var rexp = /(?!^[0-9]{5}$)^.+$/;
var str = ['12345', '12345678875645', 'pppppaaaaa', '52p26', '123'];
for (var i = 0; i < str.length; i++) {
console.log(str[i] + ' - ' + (rexp.test(str[i]) ? 'matched' : 'did not match'));
}
I assume that you need a regex to match all things except 5 digits length
You simply need to use negative lookahead assertion for excluding 5 digits. that is it.
\b(?!\d{5}).+|.{6,}\b
It excludes only 5 digits not anything else

Return a defined range of characters in Go

Let's say we have a converted float to a string:
"24.22334455667"
I want to just return 6 of the digits on the right of the decimal
I can get all digits, after the decimal this way:
re2 := regexp.MustCompile(`[!.]([\d]+)$`)
But I want only the first 6 digits after the decimal but this returns nothing:
re2 := regexp.MustCompile(`[!.]([\d]{1,6})$`)
How can I do this? I could not find an example of using [\d]{1,6}
Thanks
Alternatively...
func DecimalPlaces(decimalStr string, places int) string {
location := strings.Index(decimalStr, ".")
if location == -1 {
return ""
}
return decimalStr[location+1 : min(location+1+places, len(decimalStr))]
}
Where min is just a simple function to find the minimum of two integers.
Regular expressions seem a bit heavyweight for this sort of simple string manipulation.
Playground
You must remove the end of the line anchor $ since it won't be a line end after exactly 6 digits. For to capture exactly 6 digits, the quantifier must be
re2 := regexp.MustCompile(`[!.](\d{6})`)
Note that, this would also the digits which exists next to !. If you don't want this behaviour, you must remove the ! from the charcater class like
re2 := regexp.MustCompile(`[.](\d{6})`)
or
For to capture digits ranges from 1 to 6,
re2 := regexp.MustCompile(`[!.](\d{1,6})`)

regex with all components optionals, how to avoid empty matches

I have to process a comma separated string which contains triplets of values and translate them to runtime types,the input looks like:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each substring should be transformed this way:
"1x2y3z" should become Vector3 with x = 1, y = 2, z = 3
"80r160g255b" should become Color with r = 80, g = 160, b = 255
"48h30m50s" should become Time with h = 48, m = 30, s = 50
The problem I'm facing is that all the components are optional (but they preserve order) so the following strings are also valid Vector3, Color and Time values:
"1x3z" Vector3 x = 1, y = 0, z = 3
"255b" Color r = 0, g = 0, b = 255
"1h" Time h = 1, m = 0, s = 0
What I have tried so far?
All components optional
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The A, B and C are replaced with the correct letter for each case, the expression works almost well but it gives twice the expected results (one match for the string and another match for an empty string just after the first match), for example:
"1h1m1s" two matches [1]: "1h1m1s" [2]: ""
"11x50z" two matches [1]: "11x50z" [2]: ""
"11111h" two matches [1]: "11111h" [2]: ""
This isn't unexpected... after all an empty string matches the expression when ALL of the components are empty; so in order to fix this issue I've tried the following:
1 to 3 quantifier
((?:\d+[ABC]){1,3})
But now, the expression matches strings with wrong ordering or even repeated components!:
"1s1m1h" one match, should not match at all! (wrong order)
"11z50z" one match, should not match at all! (repeated components)
"1r1r1b" one match, should not match at all! (repeated components)
As for my last attempt, I've tried this variant of my first expression:
Match from begin ^ to the end $
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
And it works better than the first version but it still matches the empty string plus I should first tokenize the input and then pass each token to the expression in order to assure that the test string could match the begin (^) and end ($) operators.
EDIT: Lookahead attempt (thanks to Casimir et Hippolyte)
After reading and (try to) understanding the regex lookahead concept and with the help of Casimir et Hippolyte answer I've tried the suggested expression:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following test string:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
And the results were amazing! it is able to detect complete valid matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" which weren't intended to be matched at all). Unfortunately it captures emtpy matches everytime a unvalid match is found so a "" is detected just before "1s1m1h", "1h1h1h", "adfank" and "12322134445688", so I modified the Lookahead condition to get the expression below:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
It gets rid of the empty matches in any string which doesn't match (?:\d+[ABC]){1,3}) so the empty matches just before "adfank" and "12322134445688" are gone but the ones just before "1s1m1h", "1h1h1h" are stil detected.
So the question is: Is there any regular expression which matches three triplet values in a given order where all component is optional but should be composed of at least one component and doesn't match empty strings?
The regex tool I'm using is the C++11 one.
Yes, you can add a lookahead at the begining to ensure there is at least one character:
^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
If you need to find this kind of substring in a larger string (so without to tokenize before), you can remove the anchors and use a more explicit subpattern in a lookahead:
(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)
In this case, to avoid false positive (since you are looking for very small strings that can be a part of something else), you can add word-boundaries to the pattern:
\b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Note: in a comma delimited string: (?=\d+[ABC]) can be replaced by (?=[^,])
I think this might do the trick.
I am keying on either the beginning of the string to match ^ or the comma separator , for fix the start of each match: (?:^|,).
Example:
#include <regex>
#include <iostream>
const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~");
int main()
{
std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b";
std::sregex_iterator iter(test.begin(), test.end(), r);
std::sregex_iterator end_iter;
for(; iter != end_iter; ++iter)
std::cout << iter->str(1) << '\n';
}
Output:
1x2y3z
80r160g255b
48h30m50s
1x3z
255b
Is that what you are after?
EDIT:
If you really want to go to town and make empty expressions unmatched then as far as I can tell you have to put in every permutation like this:
const std::string A = "(?:\\d+[xrh])";
const std::string B = "(?:\\d+[ygm])";
const std::string C = "(?:\\d+[zbs])";
const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")");

use regular expression to find and replace but only every 3 characters for DNA sequence

Is it possible to do a find/replace using regular expressions on a string of dna such that it only considers every 3 characters (a codon of dna) at a time.
for example I would like the regular expression to see this:
dna="AAACCCTTTGGG"
as this:
AAA CCC TTT GGG
If I use the regular expressions right now and the expression was
Regex.Replace(dna,"ACC","AAA") it would find a match, but in this case of looking at 3 characters at a time there would be no match.
Is this possible?
Why use a regex? Try this instead, which is probably more efficient to boot:
public string DnaReplaceCodon(string input, string match, string replace) {
if (match.Length != 3 || replace.Length != 3)
throw new ArgumentOutOfRangeException();
var output = new StringBuilder(input.Length);
int i = 0;
while (i + 2 < input.Length) {
if (input[i] == match[0] && input[i+1] == match[1] && input[i+2] == match[2]) {
output.Append(replace);
} else {
output.Append(input[i]);
output.Append(input[i]+1);
output.Append(input[i]+2);
}
i += 3;
}
// pick up trailing letters.
while (i < input.Length) output.Append(input[i]);
return output.ToString();
}
Solution
It is possible to do this with regex. Assuming the input is valid (contains only A, T, G, C):
Regex.Replace(input, #"\G((?:.{3})*?)" + codon, "$1" + replacement);
DEMO
If the input is not guaranteed to be valid, you can just do a check with the regex ^[ATCG]*$ (allow non-multiple of 3) or ^([ATCG]{3})*$ (sequence must be multiple of 3). It doesn't make sense to operate on invalid input anyway.
Explanation
The construction above works for any codon. For the sake of explanation, let the codon be AAA. The regex will be \G((?:.{3})*?)AAA.
The whole regex actually matches the shortest substring that ends with the codon to be replaced.
\G # Must be at beginning of the string, or where last match left off
((?:.{3})*?) # Match any number of codon, lazily. The text is also captured.
AAA # The codon we want to replace
We make sure the matches only starts from positions whose index is multiple of 3 with:
\G which asserts that the match starts from where the previous match left off (or the beginning of the string)
And the fact that the pattern ((?:.{3})*?)AAA can only match a sequence whose length is multiple of 3.
Due to the lazy quantifier, we can be sure that in each match, the part before the codon to be replaced (matched by ((?:.{3})*?) part) does not contain the codon.
In the replacement, we put back the part before the codon (which is captured in capturing group 1 and can be referred to with $1), follows by the replacement codon.
NOTE
As explained in the comment, the following is not a good solution! I leave it in so that others will not fall for the same mistake
You can usually find out where a match starts and ends via m.start() and m.end(). If m.start() % 3 == 0 you found a relevant match.