Go regex find substring - regex

I have a string:
s := "root 1 12345 /root/pathtomyfolder/jdk/jdk.1.8.0.25 org.catalina.startup"
I need to grep the version number to a string
Tried,
var re = regexp.MustCompile(`jdk.*`)
func main() {
matches := re.FindStringSubmatch(s)
fmt.Printf ("%q", matches)
}

You need to specify capturing groups to extract submatches, as described in the package overview:
If 'Submatch' is present, the return value is a slice identifying the
successive submatches of the expression. Submatches are matches of
parenthesized subexpressions (also known as capturing groups) within
the regular expression, numbered from left to right in order of
opening parenthesis. Submatch 0 is the match of the entire expression,
submatch 1 the match of the first parenthesized subexpression, and so
on.
Something along the lines of the following:
func main() {
var re = regexp.MustCompile(`jdk\.([^ ]+)`)
s := "root 1 12345 /root/pathtomyfolder/jdk/jdk.1.8.0.25 org.catalina.startup"
matches := re.FindStringSubmatch(s)
fmt.Printf("%s", matches[1])
// Prints: 1.8.0.25
}
You'll of course want to check whether there actually is a submatch, or matches[1] will panic.

if you need the value in a string variable to be used / manipulated elsewhere, you could add the following in the given example Marc gave:
a := fmt.Sprintf("%s", matches[1])

Related

regex match longest substring with equal first and last char

/(\w)(\w*)\1/
For this string:"mgntdygtxrvxjnwksqhxuxtrv" I match "txrvxjnwksqhxuxt" (using Ruby), but not the even longer valid substring "tdygtxrvxjnwksqhxuxt".
For a given string, here are two ways to find the longest substring that begins and ends with the same character.
Suppose
str = "mgntdygtxrvxjnwksqhxuxtrv"
Use a regular expression
r = /(.)(?=(.*\1))/
str.gsub(r).map { $1 + $2 }.max_by(&:length)
#=> "tdygtxrvxjnwksqhxuxt".
When, as here, the regular expression contains capture groups, it may be more convenient to use String#gsub without a second argument or block (in which case it returns an enumerator, which can be chained) than String#scan (" If the pattern contains groups, each individual result is itself an array containing one entry per group.") Here gsub performs no substitutions; it merely generates matches of the regular expression.
The regular expression can be made self-documenting by writing it in free-spacing mode.
r = /
(.) # match any char and save to capture group 1
(?= # begin a positive lookahead
(.*\1) # match >= 0 characters followed by the contents of capture group 1
) # end the postive lookahead
/x # free-spacing regex definition mode
The following intermediate calculation is performed:
str.gsub(r).map { $1 + $2 }
#=> ["gntdyg", "ntdygtxrvxjn", "tdygtxrvxjnwksqhxuxt", "txrvxjnwksqhxuxt",
# "xrvxjnwksqhxux", "rvxjnwksqhxuxtr", "vxjnwksqhxuxtrv", "xjnwksqhxux",
# "xux"]
Notice that this does not enumerate all substrings beginning and ending with the same character (because .* is greedy). It does not generate, for example, the substring "xrvx".
Do not use a regular expression
v = str.each_char.with_index.with_object({}) do |(c,i),h|
if h.key?(c)
h[c][:size] = i - h[c][:start] + 1
else
h[c] = { start: i, size: 1 }
end
end.max_by { |_,h| h[:size] }.last
str[v[:start], v[:size]]
#=> "tdygtxrvxjnwksqhxuxt"

skip regex chars until search using golang

This will skip the 1st 2 characters and start matching left to right
re := regexp.MustCompile("(^.{2})(\\/path\\/subpath((\\/.*)|()))")
fmt.Println(re.MatchString("c:/path/subpath/path/subpath/")) // true
fmt.Println(re.MatchString("c:/patch/subpath/path/subpath/")) // false
notice the second one doesnt hit. even though /path/subpath exists in the string. This is perfect.
now if if dont know how many characters to skip and want to start search at the 1st '/' then i tried this
re2 := regexp.MustCompile("([^\\/])(\\/path\\/subpath((\\/.*)|()))")
fmt.Println(re2.MatchString("cddddd:/path/subpath/path/subpath")) // true
which is perfect. but if i change the 1st path
fmt.Println(re2.MatchString("cddddd:/patch/subpath/path/subpath")) // this is true as well
I don't want the last one to match the second /path/subpath. I want to be able to search in the 1st group, start the second group from there and do a left to right match.
Any help would be great appreciated.
It pays to be more precise about what you want, state what you want in absolute terms, not like "second first should not match third". Instead, say;
I want to capture the path if it begins with /path/subpath in the second group. If a path contains /path/subpath somewhere in later than the beginning, then I don't want that to match.
Also, slashes are not special in regex, so you don't need to double-escape them for nothing.
The third expression, does this:
capture everything that is not a slash from the start anchor
delimit group 1 from group 2 by :
require /path/subpath to be the at the top of the path
capture whatever remains
This may be what you want:
package main
import (
"fmt"
"regexp"
)
func main() {
paths := []string{
"c:/path/subpath/path/subpath/",
"c:/patch/subpath/path/subpath/",
"cddddd:/path/subpath/path/subpath",
}
re1 := regexp.MustCompile("(^.{2})(/path/subpath(/.*))")
re2 := regexp.MustCompile("([^/])(/path/subpath((/.*)|()))")
re3 := regexp.MustCompile(`^([^/]+):/path/subpath(/.*)`)
for i, re := range []*regexp.Regexp{re1, re2, re3} {
i++
for _, s := range paths {
fmt.Println(i, re.MatchString(s), s)
if re.MatchString(s) {
matches := re.FindStringSubmatch(s)
for m, g := range matches {
m++
if m > 1 {
fmt.Printf("\n\t%d %v", m, g)
}
}
}
println()
}
println()
}
}
Output
$ go run so-regex-path.go
(...)
3 true c:/path/subpath/path/subpath/
2 c
3 /path/subpath/
3 false c:/patch/subpath/path/subpath/
3 true cddddd:/path/subpath/path/subpath
2 cddddd
3 /path/subpath

Positive lookahead + overlapping matches regex

I'm looking for a regex to match all % that are not followed by a valid 2-characters hex code (2 characters in a-fA-F0-9). I came up with (%)(?=([0-9a-fA-F][^0-9a-fA-F]|[^0-9a-fA-F])) which works well but is not supported in golang, because of the positive lookahead (?=).
How can I translate it (or maybe make it simpler?), so that it works with go?
For example, given the string %d%2524e%25f%255E00%%%252611%25, it should match the first % and the first two ones of the %%% substring.
ie: https://regex101.com/r/y0YQ1I/2
I only tried this on regex101 (marked golang regex), but it seems that it works as expected:
%[0-9a-fA-F][0-9a-fA-F]|(%)
or simpler:
%[0-9a-fA-F]{2}|(%)
The real challenge here is that the matches at position 19 and 20 are overlapping, which means we can't use any of the go builtin "FindAll..." functions since they only find non-overlapping matches. This means that we've got to match the regex repeatedly against substrings starting after subsequent match indices if we want to find them all.
For the regex itself I've used a non-capturing group (?:...) instead of a lookahead assertion. Additionally, the regex will also match percent-signs at the end of the string, since they cannot be followed by two hex digits:
func findPlainPercentIndices(s string) []int {
re := regexp.MustCompile(`%(?:[[:xdigit:]][[:^xdigit:]]|[[:^xdigit:]]|$)`)
indices := []int{}
idx := 0
for {
m := re.FindStringIndex(s[idx:])
if m == nil {
break
}
nextidx := idx + m[0]
indices = append(indices, nextidx)
idx = nextidx + 1
}
return indices
}
func main() {
str := "%d%2524e%25f%255E00%%%252611%25%%"
// 012345678901234567890123456789012
// 0 1 2 3
fmt.Printf("OK: %#v\n", findPlainPercentIndices(str))
// OK: []int{0, 19, 20, 31, 32}
}

Golang regexp to match multiple patterns between keyword pairs

I have a string which has two keywords: "CURRENT NAME(S)" and "NEW NAME(S)" and each of these keywords are followed by a bunch of words. I want to extract those set of words beyond each of these keywords. To elaborate with a code:
s := `"CURRENT NAME(S)
Name1, Name2",,"NEW NAME(S)
NewName1,NewName2"`
re := regexp.MustCompile(`"CURRENT NAME(S).*",,"NEW NAME(S).*"`)
segs := re.FindAllString(s, -1)
fmt.Println("segs:", segs)
segs2 := re.FindAllStringSubmatch(s, -1)
fmt.Println("segs2:", segs2)
As you can see, the string 's' has the input. "Name1,Name2" is the current names list and "NewName1, NewName2" is the new names list. I want to extract these two lists. The two lists are separated by a comma. Each of the keywords are beginning with a double quote and their reach ends, when their corresponding double quote ends.
What is the way to use regexp such that the program can print "Name1, Name2" and "NewName1,NewName2" ?
The issue with your regex is that the input string contains newline symbols, and . in Go regex does not match a newline. Another issue is that the .* is a greedy pattern and will match as many symbols as it can up to the last second keyword. Also, you need to escape parentheses in the regex pattern to match the ( and ) literal symbols.
The best way to solve the issue is to change .* into a negated character class pattern [^"]* and place it inside a pair of non-escaped ( and ) to form a capturing group (a construct to get submatches from the match).
Here is a Go demo:
package main
import (
"fmt"
"regexp"
)
func main() {
s := `"CURRENT NAME(S)
Name1, Name2",,"NEW NAME(S)
NewName1,NewName2"`
re := regexp.MustCompile(`"CURRENT NAME\(S\)\s*([^"]*)",,"NEW NAME\(S\)\s*([^"]*)"`)
segs2 := re.FindAllStringSubmatch(s,-1)
fmt.Printf("segs2: [%s; %s]", segs2[0][1], segs2[0][2])
}
Now, the regex matches:
"CURRENT NAME\(S\) - a literal string "CURRENT NAME(S)`
\s* - zero or more whitespaces
([^"]*) - Group 1 capturing 0+ chars other than "
",,"NEW NAME\(S\) - a literal string ",,"NEW NAME(S)
\s* - zero or more whitespaces
([^"]*) - Group 2 capturing 0+ chars other than "
" - a literal "
If your input doesn't change then the simplest way would be to use submatches (groups). You can try something like this:
// (?s) is a flag that enables '.' to match newlines
var r = regexp.MustCompile(`(?s)CURRENT NAME\(S\)(.*)",,"NEW NAME\(S\)(.*)"`)
fmt.Println(r.MatchString(s))
m := r.FindSubmatch([]byte(s)) // FindSubmatch requires []byte
for _, match := range m {
s := string(match)
fmt.Printf("Match - %d: %s\n", i, strings.Trim(s, "\n")) //remove the newline
}
Output: (Note that the first match is the entire input string because it completely matches the regex (https://golang.org/pkg/regexp/#Regexp.FindSubmatch)
Match - 0: CURRENT NAME(S)
Name1, Name2",,"NEW NAME(S)
NewName1,NewName2"
Match - 1: Name1, Name2
Match - 2: NewName1,NewName2
Example: https://play.golang.org/p/0cgBOMumtp
For a fixed format like in the example, you can also avoid regular expressions and perform explicit parsing as in this example - https://play.golang.org/p/QDIyYiWJHt:
package main
import (
"fmt"
"strings"
)
func main() {
s := `"CURRENT NAME(S)
Name1, Name2",,"NEW NAME(S)
NewName1,NewName2"`
names := []string{}
parts := strings.Split(s, ",,")
for _, part := range parts {
part = strings.Trim(part, `"`)
part = strings.TrimPrefix(part, "CURRENT NAME(S)")
part = strings.TrimPrefix(part, "NEW NAME(S)")
part = strings.TrimSpace(part)
names = append(names, part)
}
fmt.Println("Names:")
for _, name := range names {
fmt.Println(name)
}
}
Output:
Names:
Name1, Name2
NewName1,NewName2
It uses a few more lines of code but may make it easier to understand the processing logic at a first glance.

use regular expression to find and replace but only every 3 characters for DNA sequence

Is it possible to do a find/replace using regular expressions on a string of dna such that it only considers every 3 characters (a codon of dna) at a time.
for example I would like the regular expression to see this:
dna="AAACCCTTTGGG"
as this:
AAA CCC TTT GGG
If I use the regular expressions right now and the expression was
Regex.Replace(dna,"ACC","AAA") it would find a match, but in this case of looking at 3 characters at a time there would be no match.
Is this possible?
Why use a regex? Try this instead, which is probably more efficient to boot:
public string DnaReplaceCodon(string input, string match, string replace) {
if (match.Length != 3 || replace.Length != 3)
throw new ArgumentOutOfRangeException();
var output = new StringBuilder(input.Length);
int i = 0;
while (i + 2 < input.Length) {
if (input[i] == match[0] && input[i+1] == match[1] && input[i+2] == match[2]) {
output.Append(replace);
} else {
output.Append(input[i]);
output.Append(input[i]+1);
output.Append(input[i]+2);
}
i += 3;
}
// pick up trailing letters.
while (i < input.Length) output.Append(input[i]);
return output.ToString();
}
Solution
It is possible to do this with regex. Assuming the input is valid (contains only A, T, G, C):
Regex.Replace(input, #"\G((?:.{3})*?)" + codon, "$1" + replacement);
DEMO
If the input is not guaranteed to be valid, you can just do a check with the regex ^[ATCG]*$ (allow non-multiple of 3) or ^([ATCG]{3})*$ (sequence must be multiple of 3). It doesn't make sense to operate on invalid input anyway.
Explanation
The construction above works for any codon. For the sake of explanation, let the codon be AAA. The regex will be \G((?:.{3})*?)AAA.
The whole regex actually matches the shortest substring that ends with the codon to be replaced.
\G # Must be at beginning of the string, or where last match left off
((?:.{3})*?) # Match any number of codon, lazily. The text is also captured.
AAA # The codon we want to replace
We make sure the matches only starts from positions whose index is multiple of 3 with:
\G which asserts that the match starts from where the previous match left off (or the beginning of the string)
And the fact that the pattern ((?:.{3})*?)AAA can only match a sequence whose length is multiple of 3.
Due to the lazy quantifier, we can be sure that in each match, the part before the codon to be replaced (matched by ((?:.{3})*?) part) does not contain the codon.
In the replacement, we put back the part before the codon (which is captured in capturing group 1 and can be referred to with $1), follows by the replacement codon.
NOTE
As explained in the comment, the following is not a good solution! I leave it in so that others will not fall for the same mistake
You can usually find out where a match starts and ends via m.start() and m.end(). If m.start() % 3 == 0 you found a relevant match.