Extract substring with regex? - crystal-lang

String.match returns a MatchData, but how can I get the matched string from MatchData?
puts "foo bar".match(/(foo)/)
output:
#<MatchData "foo" 1:"foo">
Sorry I am new to crystal.

You can access it via the well known group indexes, make sure to handle the nil (no match) case.
match = "foo bar".match(/foo (ba(r))/)
if match
# The full match
match[0] #=> "foo bar"
# The first capture group
match[1] #=> "bar"
# The second capture group
match[2] #=> "r"
end
You can find more information about MatchData in its API docs

Related

Capture one suffix containing known substring when multiple matching prefixes (without known substring) found

Given an input of multiple string, some containing the prefix is:, I need to capture one instance of the substring "Foo" or "Bar" following the is: prefix regardless of how many times is:Foo/is:Bar or is:Baz/is:Xyzzy appear.
Using the following regex: .*is:\b([Foo|Bar]*)\b.*
And using the following examples of test input lines with matches:
"is:Baz is:Foo FooBar" # Captures "Foo"
"is:Foo FooBar is:Bar" # Captures "Bar"
"is:Bar FooBar FooBaz Baz" # Captures "Bar"
"FooBar is:Bar FooBaz" # Captures "Bar"
"FooBar is:Xyzzy is:Foo" # Captures "Foo
"is:Baz FooBar is:Foo" # Captures "Foo"
"FooBar is:Foo is:Xyzzy" # No capture
In the final line I want to also capture is:Foo, but the capture is thrown off by is:Xyzzy. This isn't an exhaustive list of possible test cases but it illustrates to problem I'm coming up against.
You can write the pattern using a grouping without the [ and ] that denotes a character class.
You don't need a word boundary here :\b as it is implicit due to the following alternation of (Foo|Bar)
You can append a word boundary before \bis
.*\bis:(Foo|Bar)\b.*
See a regex101 demo.

How to match any pattern to a any string using Ruby?

I would like to create a function in Ruby which accepts the following parameters:
A pattern string (e.g. "abab", "aabb", "aaaa", etc.)
An input string (e.g. "dogcatdogcat", "carcarhousehouse", etc.)
The return of the function should be "true" if the string matches the pattern and "false" if not.
My approach for the first step:
Use regex in order to separate the input string into an array of words (e.g. ["dog", "cat", "dog", "cat"]).
My regex expertise is not good enough to be able to find the right regex for this problem.
Does anyone know how to perform the appropriate regex so that recurring words get separated assuming the input string is always some form of pattern?
You can use capture groups and backreferences to match the same substring multiple times, e.g.:
abab = /\A(.+)(.+)\1\2\z/
aabb = /\A(.+)\1(.+)\2\z/
aaaa = /\A(.+)\1\1\1\z/
'dogcatdogcat'.match?(abab) #=> true
'dogcatdogcat'.match?(aabb) #=> false
'dogcatdogcat'.match?(aaaa) #=> false
'carcarhousehouse'.match?(abab) #=> false
'carcarhousehouse'.match?(aabb) #=> true
'carcarhousehouse'.match?(aaaa) #=> false
In the above pattern, (.+) defines a capture group that matches one or more characters. \1 then refers to the 1st capturing group and matches the same substring. (\2 is the 2nd group and so on)
\A and \z are anchors to match the beginning and end of the string.

How to validate the format of a string in Ruby, while extracting the matches?

What I want
validate that a string matches this format: /^(#\d\s*)+$/ (#1 #2 for instance).
Grab all the numbers with the hash, something like #<MatchData "1234" 1:"#1" 2:"#2">. It doesnt have to be a MatchData object, any type of array, enumerable would work.
My issue
When using match, it just matches the last occurence:
/^(#\d\s*)+$/.match "#1 #2"
# => #<MatchData "#1 #2" 1:"#2">
When I use scan, it "works":
"#1 #2".scan /#\d/
# => ["#1", "#2"]
But I dont believe I can validate the format of the string, as it would return the same for "aaa #1 #2".
The question
Can I, with only 1 method call, both validates that my string matches /^(#\d\s*)+$/ AND grab all the instances of #number?
I kinda feel bad about asking this since I've been using ruby for a while now. It seems simple but I can't get that to work.
Yes, you may use
s.scan(/(?:\G(?!\A)|\A(?=(?:#\d\s*)*\z))\s*\K#\d/)
See the regex demo
Details
(?:\G(?!\A)|\A(?=(?:#\d\s*)*\z)) - two alternatives:
\G(?!\A) - the end of the previous successful match
| - or
\A(?=(?:#\d\s*)*\z) - start of string (\A) that is followed with 0 or more repetitions of # + digit + 0+ whitespaces and then followed with the end of string
\s* - 0+ whitespace chars
\K - match reset operator discarding the text matched so far
#\d - a # char and then a digit
In short: the start of string position is matched first, but only if the string to the right (i.e. the whole string) matches the pattern you want. Since that check is performed with a lookahead, the regex index stays where it was, and then matching occurs all the time ONLY after a valid match thanks to the \G operator (it matches the start of string or end of previous match, so (?!\A) is used to subtract the start string position).
Ruby demo:
rx = /(?:\G(?!\A)|\A(?=(?:#\d\s*)*\z))\s*\K#\d/
p "#1 #2".scan(rx)
# => ["#1", "#2"]
p "#1 NO #2".scan(rx)
# => []
def doit(str)
r = /\A#{"(#\\d)\\s*"*str.count('#')}\z/
str.match(r)&.captures
end
doit "#1#2 #3 " #=> ["#1", "#2", "#3"]
doit " #1#2 #3 " #=> nil
Notice the regular expressions depend only on the number of instances of the character '#' in the string. As that number is three in both examples the respective regular expressions are equal, namely:
/\A(#\d)\s*(#\d)\s*(#\d)\s*\z/
This regular expression was constructed as follows.
str = "#1#2 #3 "
n = str.count('#')
#=> 3
s = "(#\\d)\\s*"*n
#=> "(#\\d)\\s*(#\\d)\\s*(#\\d)\\s*"
/\A#{s}\z/
#=> /\A(#\d)\s*(#\d)\s*(#\d)\s*\z/
The regular expression reads, "match the beginning of the string followed by three identical capture groups, each optionally followed by spaces, followed by the end of the string. The regular expression therefore both tests the validity of the string and extracts the desired matches in the capture groups.
The safe navigation operator, & is needed in the event that there is no match (match returns nil).
A comment by the OP refers to a generalisation of the question in which the pound character ('#') is optional. That can be dealt with by modifying the regular expression as follows.
def doit(str)
r = /\A#{"(?:#?(\\d)(?=#|\\s+|\\z)\\s*)"*str.count('0123456789')}\z/
str.match(r)&.captures
end
doit "1 2 #3 " #=> ["1", "2", "3"]
doit "1 2 #3 " #=> ["1", "2", "3"]
doit "1#2" #=> ["1", "2"]
doit " #1 2 #3 " #=> nil
doit "#1 2# 3 " #=> nil
doit " #1 23 #3 " #=> nil
For strings containing three digits the regular expression is:
/\A(?:#?(\d)(?=#|\s+|\z)\s*)(?:#?(\d)(?=#|\s+|\z)\s*)(?:#?(\d)(?=#|\s+|\z)\s*)\z/
While it is true that this regular expression can potentially be quite long, that does not necessarily mean that it would be relatively inefficient, as the lookaheads are quite localized.

Regex to parse log data not capturing all groups [duplicate]

I'm testing this on regex101.com
Regex: ^\+([0-9A-Za-z-]+)(?:\.([0-9A-Za-z-]+))*$
Test string: +beta-bar.baz-bz.fd.zz
The string matches, but the "match information" box shows that there are only two capture groups:
MATCH 1
1. [1-9] `beta-bar`
2. [20-22] `zz`
I was expecting all these captures:
beta-bar
baz-bz
fd
zz
Why didn't each identifier between periods get recognized as its own captured group?
The reason why that happens is because when using a quantifier on a capture group and it is captured n times, only the last captured text gets stored in the buffer and returned at the end.
Instead of matching those parts, you can preg_split the string you have with a simple regex [+.]:
$str = "+beta-bar.baz-bz.fd.zz";
$a = preg_split('/[+.]/', $str, -1, PREG_SPLIT_NO_EMPTY);
See IDEONE demo
Result:
Array
(
[0] => beta-bar
[1] => baz-bz
[2] => fd
[3] => zz
)

Why doesn't this regex capture group repeat for each match?

I'm testing this on regex101.com
Regex: ^\+([0-9A-Za-z-]+)(?:\.([0-9A-Za-z-]+))*$
Test string: +beta-bar.baz-bz.fd.zz
The string matches, but the "match information" box shows that there are only two capture groups:
MATCH 1
1. [1-9] `beta-bar`
2. [20-22] `zz`
I was expecting all these captures:
beta-bar
baz-bz
fd
zz
Why didn't each identifier between periods get recognized as its own captured group?
The reason why that happens is because when using a quantifier on a capture group and it is captured n times, only the last captured text gets stored in the buffer and returned at the end.
Instead of matching those parts, you can preg_split the string you have with a simple regex [+.]:
$str = "+beta-bar.baz-bz.fd.zz";
$a = preg_split('/[+.]/', $str, -1, PREG_SPLIT_NO_EMPTY);
See IDEONE demo
Result:
Array
(
[0] => beta-bar
[1] => baz-bz
[2] => fd
[3] => zz
)