Why this regular expression in Golang is non greedy? [duplicate] - regex

This question already has answers here:
Lazy quantifier {,}? not working as I would expect
(3 answers)
Closed 2 years ago.
Here is a simple regular expression:
package main
import (
"fmt"
"regexp"
)
const data = "abcdefghijklmn"
func main() {
r, err := regexp.Compile(".{1,6}")
if err != nil {
panic(err)
}
for _, d := range r.FindAllIndex([]byte(data), -1) {
fmt.Println(data[d[0]:d[1]])
}
}
And we know it is greedy:
abcdef
ghijkl
mn
Now, we can add a ? after the expression to make it non greedy:
package main
import (
"fmt"
"regexp"
)
const data = "abcdefghijklmn"
func main() {
r, err := regexp.Compile(".{1,6}?")
if err != nil {
panic(err)
}
for _, d := range r.FindAllIndex([]byte(data), -1) {
fmt.Println(data[d[0]:d[1]])
}
}
And we can get:
a
b
c
d
e
f
g
h
i
j
k
l
m
n
However, if we add other chars after the expression, it becomes greedy:
package main
import (
"fmt"
"regexp"
)
const data = "abcdefghijklmn"
func main() {
r, err := regexp.Compile(".{1,6}?k")
if err != nil {
panic(err)
}
for _, d := range r.FindAllIndex([]byte(data), -1) {
fmt.Println(data[d[0]:d[1]])
}
}
And we get:
efghijk
So why it becomes greedy if we add a char after it?

Adding a lazy quantifier after a repetition count changes it from matching as many as possible, to as few as possible.
However, this does not change the fact that the string must be processed serially. This is where your two cases differ:
.{1,6}? returns one character at a time because this is the fewest matches as the string is being processed. The lazy quantifier lets the engine match after a single character, not needing to keep processing the string.
.{1,6}?k has to skip over abcd to get a match, but it then finds the substring starting at e to be a match. A lazy quantifier does not let the engine move to the next character in the string.
In short: matching from the current position takes precedence over moving to the next position in the hope of a smaller match.
As for your question about making it lazy again, you can't. You'll have to find a different regular expression for the output you want.

Related

Czech characters in regexp search

I am trying to implement very simple text matcher for Czech words. Since Czech language is very suffix heavy I want to define start of the word and then just greedy match rest of the word. This is my implementation so far:
r := regexp.MustCompile("(?i)\\by\\w+\\b")
text := "x yž z"
matches := r.FindAllString(text, -1)
fmt.Println(matches) //have [], want [yž]
I studied Go's regexp syntax:
https://github.com/google/re2/wiki/Syntax
but I don't know, how to define czech language characters there? Using \w just matches ASCII characters, not Czech UTF characters.
Can you please help me?
In RE2, both \w and \b are not Unicode-aware:
\b at ASCII word boundary («\w» on one side and «\W», «\A», or «\z» on the other)
\w word characters (== [0-9A-Za-z_])
A more generalized example will be to split with any chunk of one or more non-letter chars, and then collect only those items that meet your criteria:
package main
import (
"fmt"
"strings"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`\P{L}+`)
str := "x--++yž,,,.z..00"
words := r.Split(str, -1)
for i := range words {
if len(words[i]) > 0 && (strings.HasPrefix(words[i], `y`) || (strings.HasPrefix(words[i], `Y`)) {
output = append(output, words[i])
}
}
fmt.Println(output)
}
See the Go demo.
Note that a naive approach like
package main
import (
"fmt"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`(?i)(?:\P{L}|^)(y\p{L}*)(?:\P{L}|$)`)
str := "x--++yž,,,.z..00..."
matches := r.FindAllStringSubmatch(str, -1)
for _, v := range matches {
output = append(output, v[1])
}
fmt.Println(output)
}
won't work in case you have match1,match2 match3 like consecutive matches in the string as it will only getch the odd occurrences since the last non-capturing group pattern will consume the char that is supposed to be matched by the first non-capturing group pattern upon the next match.
A workaround for the above code would be adding some non-letter char to the end of the non-letter streaks, say
package main
import (
"fmt"
"regexp"
)
func main() {
output := []string{}
r := regexp.MustCompile(`(?i)(?:\P{L}|^)(u\p{L}*)(?:\P{L}|$)`)
str := "uhličitá,uhličité,uhličitou,uhličitého,yz,my"
matches := r.FindAllStringSubmatch(regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, `$0 `), -1)
for _, v := range matches {
output = append(output, v[1])
}
fmt.Println(output)
}
// => [uhličitá uhličité uhličitou uhličitého]
See this Go demo.
Here, regexp.MustCompile(`\P{L}+`).ReplaceAllString(str, `$0 `) adds a space after all chunks of non-letter chars.

Issues with regex. nil slice and FindStringSubmatch

I'm trying to regex a pattern:
random-text (800)
I'm doing something like this:
func main() {
rando := "random-text (800)"
parsedThing := regexp.MustCompile(`\((.*?)\)`)
match := parsedThing.FindStringSubmatch(rando)
if match[1] == "" {
fmt.Println("do a thing")
}
if match[1] != "" {
fmt.Println("do a thing")
}
}
I only want to capture what's in the parentheses but FindString is parsing the (). I've also tried FindStringSubmatch, which is great I can specify the capture group in the slice but then I have an error in my unit test, that the slice is . I need to test for an empty string as that's a thing that could happen. Is there a better regex, that I can use that will only capture inside the parentheses? Or is there a better way to error handle for an nil slice.
I usually compare against nil, based on the documentation:
A return value of nil indicates no match.
package main
import (
"fmt"
"regexp"
)
func main() {
re := regexp.MustCompile(`\((.+)\)`)
find := re.FindStringSubmatch("random-text (800)")
if find != nil {
fmt.Println(find[1] == "800")
}
}

Find all strings in between two strings in Go

I am working on extracting mutliple matches between two strings.
In the example below, I am trying to regex out an A B C substring out of my string.
Here is my code:
package main
import (
"fmt"
"regexp"
)
func main() {
str:= "Movies: A B C Food: 1 2 3"
re := regexp.MustCompile(`[Movies:][^Food:]*`)
match := re.FindAllString(str, -1)
fmt.Println(match)
}
I am clearly doing something wrong in my regex. I am trying to get the A B C string between Movies: and Food:.
What is the proper regex to get all strings between two strings?
In Go, since its RE2-based regexp does not support lookarounds, you need to use capturing mechanism with regexp.FindAllStringSubmatch function:
left := "LEFT_DELIMITER_TEXT_HERE"
right := "RIGHT_DELIMITER_TEXT_HERE"
rx := regexp.MustCompile(`(?s)` + regexp.QuoteMeta(left) + `(.*?)` + regexp.QuoteMeta(right))
matches := rx.FindAllStringSubmatch(str, -1)
Note the use of regexp.QuoteMeta that automatically escapes all special regex metacharacters in the left- and right-hand delimiters.
The (?s) makes . match across lines and (.*?) captures all between ABC and XYZ into Group 1.
So, here you can use
package main
import (
"fmt"
"regexp"
)
func main() {
str:= "Movies: A B C Food: 1 2 3"
r := regexp.MustCompile(`Movies:\s*(.*?)\s*Food`)
matches := r.FindAllStringSubmatch(str, -1)
for _, v := range matches {
fmt.Println(v[1])
}
}
See the Go demo. Output: A B C.

IPv4 regexp capturing the incorrect parts of the address [duplicate]

This question already has answers here:
Match exact string
(3 answers)
Closed 3 years ago.
I'm trying to write a program that prints the invalid part or parts of an IPv4 address from terminal input.
Here is my code:
package chapter4
import (
"bufio"
"fmt"
"os"
"regexp"
"strings"
"time"
)
func IPV4() {
var f *os.File
f = os.Stdin
defer f.Close()
scanner := bufio.NewScanner(f)
fmt.Println("Exercise 1, Chapter 4 - Detecting incorrect parts of IPv4 Addresses, enter an address!")
for scanner.Scan() {
if scanner.Text() == "STOP" {
fmt.Println("Initializing Level 4...")
time.Sleep(5 * time.Second)
break
}
expression := "(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])"
matchMe, err := regexp.Compile(expression)
if err != nil {
fmt.Println("Could not compile!", err)
}
s := strings.Split(scanner.Text(), ".")
for _, value := range s {
fmt.Println(value)
str := matchMe.FindString(value)
if len(str) == 0 {
fmt.Println(value)
}
}
}
}
My thought process is that for every terminal IP address input, I split the string by '.'
Then I iterate over the resulting []string and match each value to the regular expression.
For some reason the only case where the regex expression doesn't match is when there are letter characters in the input. Every number, no matter the size or composition, is a valid match for my expression.
I'm hoping you can help me identify the problem, and if there's a better way to do it, I'm all ears. Thanks!
Maybe, this expression might be closer to what you might have in mind:
^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$
Test
package main
import (
"regexp"
"fmt"
)
func main() {
var re = regexp.MustCompile(`(?m)^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$`)
var str = `127.0.0.1
192.168.1.1
192.168.1.255
255.255.255.255
0.0.0.0
1.1.1.01
30.168.1.255.1
127.1
192.168.1.256
-1.2.3.4
3...3`
for i, match := range re.FindAllString(str, -1) {
fmt.Println(match, "found at index", i)
}
}
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
Reference:
Validating IPv4 addresses with regexp
RegEx Circuit
jex.im visualizes regular expressions:
I am pretty sure that your expression needs anchors or the last part of it will match any single digit and succeed. Try using ^ on the front and $ on the back.

Regex extracting sets of numbers from string when prefix occurs, while not matching said prefix

As stated in the title, given a situation where I have a string like so:
"somestring~200~122"
I am wanting to regex to match the numbers when the prefix "~" occurs. So I can ultimately end up with [200, 122].
Matching the prefix is necessary as I need to protect against a case where a string like the one below should not be matched
"somestring~abc200~def122"
For additional context: As stated in the title, I am using go so I am planning on using doing something like the following in order to obtain the numbers within the string:
pattern := regexp.MustCompile("regex i need help with")
numbers := pattern.FindAllString(host, -1)
You can use FindAllStringSubmatch to extract the group containing just the digits. Below is an example that finds all instances of ~ followed by numbers. It additionally converts all the matches to ints
and inserts them into a slice:
package main
import (
"fmt"
"regexp"
"strconv"
)
func main() {
host := "somestring~200~122"
pattern := regexp.MustCompile(`~(\d+)`)
numberStrings := pattern.FindAllStringSubmatch(host, -1)
numbers := make([]int, len(numberStrings))
for i, numberString := range numberStrings {
number, err := strconv.Atoi(numberString[1])
if err != nil {
panic(err)
}
numbers[i] = number
}
fmt.Println(numbers)
}
https://play.golang.org/p/09YyewtRXz