Ruby (2.6.0) Regex multiple matches [duplicate] - regex

Is there a quick way to find every match of a regular expression in Ruby? I've looked through the Regex object in the Ruby STL and searched on Google to no avail.

Using scan should do the trick:
string.scan(/regex/)

To find all the matching strings, use String's scan method.
str = "A 54mpl3 string w1th 7 numb3rs scatter36 ar0und"
str.scan(/\d+/)
#=> ["54", "3", "1", "7", "3", "36", "0"]
If you want, MatchData, which is the type of the object returned by the Regexp match method, use:
str.to_enum(:scan, /\d+/).map { Regexp.last_match }
#=> [#<MatchData "54">, #<MatchData "3">, #<MatchData "1">, #<MatchData "7">, #<MatchData "3">, #<MatchData "36">, #<MatchData "0">]
The benefit of using MatchData is that you can use methods like offset:
match_datas = str.to_enum(:scan, /\d+/).map { Regexp.last_match }
match_datas[0].offset(0)
#=> [2, 4]
match_datas[1].offset(0)
#=> [7, 8]
See these questions if you'd like to know more:
"How do I get the match data for all occurrences of a Ruby regular expression in a string?"
"Ruby regular expression matching enumerator with named capture support"
"How to find out the starting point for each match in ruby"
Reading about special variables $&, $', $1, $2 in Ruby will be helpful too.

if you have a regexp with groups:
str="A 54mpl3 string w1th 7 numbers scatter3r ar0und"
re=/(\d+)[m-t]/
you can use String's scan method to find matching groups:
str.scan re
#> [["54"], ["1"], ["3"]]
To find the matching pattern:
str.to_enum(:scan,re).map {$&}
#> ["54m", "1t", "3r"]

You can use string.scan(your_regex).flatten. If your regex contains groups, it will return in a single plain array.
string = "A 54mpl3 string w1th 7 numbers scatter3r ar0und"
your_regex = /(\d+)[m-t]/
string.scan(your_regex).flatten
=> ["54", "1", "3"]
Regex can be a named group as well.
string = 'group_photo.jpg'
regex = /\A(?<name>.*)\.(?<ext>.*)\z/
string.scan(regex).flatten
You can also use gsub, it's just one more way if you want MatchData.
str.gsub(/\d/).map{ Regexp.last_match }

If you have capture groups () inside the regex for other purposes, the proposed solutions with String#scan and String#match are problematic:
String#scan only get what is inside the capture groups;
String#match only get the first match, rejecting all the others;
String#matches (proposed function) get all the matches.
On this case, we need a solution to match the regex without considering the capture groups.
String#matches
With the Refinements you can monkey patch the String class, implement the String#matches and this method will be available inside the scope of the class that is using the refinement. It is an incredible way to Monkey Patch classes on Ruby.
Setup
/lib/refinements/string_matches.rb
# This module add a String refinement to enable multiple String#match()s
# 1. `String#scan` only get what is inside the capture groups (inside the parens)
# 2. `String#match` only get the first match
# 3. `String#matches` (proposed function) get all the matches
module StringMatches
refine String do
def matches(regex)
scan(/(?<matching>#{regex})/).flatten
end
end
end
Used: named capture groups
Usage
rails c
> require 'refinements/string_matches'
> using StringMatches
> 'function(1, 2, 3) + function(4, 5, 6)'.matches(/function\((\d), (\d), (\d)\)/)
=> ["function(1, 2, 3)", "function(4, 5, 6)"]
> 'function(1, 2, 3) + function(4, 5, 6)'.scan(/function\((\d), (\d), (\d)\)/)
=> [["1", "2", "3"], ["4", "5", "6"]]
> 'function(1, 2, 3) + function(4, 5, 6)'.match(/function\((\d), (\d), (\d)\)/)[0]
=> "function(1, 2, 3)"

Return an array of MatchData objects
#scan is very limited--only returns a simple array of strings!
Far more powerful/flexible for us to get an array of MatchData objects.
I'll provide two approaches (using same logic), one using a PORO and one using a monkey patch:
PORO:
class MatchAll
def initialize(string, pattern)
raise ArgumentError, 'must pass a String' unless string.is_a?(String)
raise ArgumentError, 'must pass a Regexp pattern' unless pattern.is_a?(Regexp)
#string = string
#pattern = pattern
#matches = []
end
def match_all
recursive_match
end
private
def recursive_match(prev_match = nil)
index = prev_match.nil? ? 0 : prev_match.offset(0)[1]
matching_item = #string.match(#pattern, index)
return #matches unless matching_item.present?
#matches << matching_item
recursive_match(matching_item)
end
end
USAGE:
test_string = 'a green frog jumped on a green lilypad'
MatchAll.new(test_string, /green/).match_all
=> [#<MatchData "green", #<MatchData "green"]
Monkey patch
I don't typically condone monkey-patching, but in this case:
we're doing it the right way by "quarantining" our patch into its own module
I prefer this approach because 'string'.match_all(/pattern/) is more intuitive (and looks a lot nicer) than MatchAll.new('string', /pattern/).match_all
module RubyCoreExtensions
module String
module MatchAll
def match_all(pattern)
raise ArgumentError, 'must pass a Regexp pattern' unless pattern.is_a?(Regexp)
recursive_match(pattern)
end
private
def recursive_match(pattern, matches = [], prev_match = nil)
index = prev_match.nil? ? 0 : prev_match.offset(0)[1]
matching_item = self.match(pattern, index)
return matches unless matching_item.present?
matches << matching_item
recursive_match(pattern, matches, matching_item)
end
end
end
end
I recommend creating a new file and putting the patch (assuming you're using Rails) there /lib/ruby_core_extensions/string/match_all.rb
To use our patch we need to make it available:
# within application.rb
require './lib/ruby_core_extensions/string/match_all.rb'
Then be sure to include it in the String class (you could put this wherever you want; but for example, right under the require statement we just wrote above. After you include it once, it will be available everywhere, even outside the class where you included it).
String.include RubyCoreExtensions::String::MatchAll
USAGE: And now when you use #match_all you get results like:
test_string = 'hello foo, what foo are you going to foo today?'
test_string.match_all /foo/
=> [#<MatchData "foo", #<MatchData "foo", #<MatchData "foo"]
test_string.match_all /hello/
=> [#<MatchData "hello"]
test_string.match_all /none/
=> []
I find this particularly useful when I want to match multiple occurrences, and then get useful information about each occurrence, such as which index the occurrence starts and ends (e.g. match.offset(0) => [first_index, last_index])

Related

Regex- to get part of String

I have got below string and I need to Get all the values Between Pizzahut: and |.
ABC:2fg45rdvsg|Pizzahut:j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg|
I have got RegExpression .scan(/(?<=Pizzahut:)([.*\s\S]+)(?=\|)/) but it fetches
"j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg|"
Result should be: 34532jdhgj,3242237,67688873rg
You can use
s='ABC:2fg45rdvsg|Pizzahut:j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg|'
p s.scan(/Pizzahut:([^|]+)/).flatten
# => ["j34532jdhgj", "3242237", "67688873rg"]
See this Ruby demo and the Rubular demo.
It does not look possible that you have Pizzahut as a part of another word, but it is possible, use a version with a word boundary, /\bPizzahut:([^|]+)/.
The Pizzahut:([^|]+) matches Pizzahut: and then captures into Group 1 any one or more chars other than a pipe (with ([^|]+)).
Note that String#scan returns the captures only if a pattern contains a capturing group, so you do not need to use lookarounds.
I'm not sure why you're jumping to a regex solution here; that input string clearly looks structured to me, and you would probably do better by splitting it on the delimiters to convert it into a more convenient data structure.
Something like this:
input = "ABC:2fg45rdvsg|Pizzahut:j34532jdhgj|Dominos:3424232|Pizzahut:3242237|Wendys:3462783|Pizzahut:67688873rg"
converted_input = input
.split('|') #=> ["ABC:2fg45rdvsg", "Pizzahut:j34532jdhgj", ... ]
.map { |pair| pair.split(':') } #=> [["ABC", "2fg45rdvsg"], ["Pizzahut", "j34532jdhgj"], ... ]
.group_by(&:first) #=> {"ABC"=>[["ABC", "2fg45rdvsg"]], "Pizzahut"=>[["Pizzahut", "j34532jdhgj"], ... ], "Dominos"=>[["Dominos", "3424232"]], ... ]
.transform_values { |v| v.flat_map(&:last) }
(The above series of transformations is just one possible way; you could probably come up with a dozen similar alternative steps to convert this input into the same hash shown below! For example, by using reduce or even the CSV library.)
Which gives you the final result:
converted_input = {
"ABC" => ["2fg45rdvsg"],
"Pizzahut" => ["j34532jdhgj", "3242237", "67688873rg"],
"Dominos" => ["3424232"],
"Wendys" => ["3462783"]
}
Now that the data is formatted conveniently, obtaining data like your original request becomes trivial:
converted_input["Pizzahut"].join(',') #=> "j34532jdhgj,3242237,67688873rg"
(Although quite likely it would be more suitable to leave it as an Array, not a comma-separated String!!)

Entire text is matched but not able to group in named groups

I have following example text:
my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10
I want to extract sub-fields from it in following way:
appName = my_app,
[
{key = key1, value = value1},
{key = user_id, value = testuser},
{key = ip_address, value = 10.10.10.10}
]
I have written following regex for doing this:
(?<appName>\w+)\|(((?<key>\w+)?(?<equals>=)(?<value>[^\|]+))\|?)+
It matches the entire text but is not able to group it correctly in named groups.
Tried testing it on https://regex101.com/
What am I missing here?
I think the main problem you have is trying to write a regex that matches ALL the key=value pairs. That's not the way to do it. The correct way is based on a pattern that matches ONLY ONE key=value, but is applied by a function that finds all accurances of the pattern. Every languages supplies such a function. Here's the code in Python for example:
import re
txt = 'my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
pairs = re.findall(r'(\w+)=([^|]+)', txt)
print(pairs)
This gives:
[('key1', 'value1'), ('user_id', 'testuser'), ('ip_address', '10.10.10.10')]
The pattern matches a key consisting of alpha-numeric chars - (\w+) with a value. The value is designated by ([^|]+), that is everything but a vertical line, because the value can have non-alpha numeric values, such a dot in the ip address.
Mind the findall function. There's a search function to catch a pattern once, and there's a findall function to catch all the patterns within the text.
I tested it on regex101 and it worked.
I must comment, though, that the specific text pattern you work on doesn't require regex. All high level languages supply a split function. You can split by vertical line, and then each slice you get (expcept the first one) you split again by the equal sign.
Use the PyPi regex module with the following code:
import regex
s = "my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10"
rx = r"(?<appName>\w+)(?:\|(?<key>\w+)=(?<value>[^|]+))+"
print( [(m.group("appName"), dict(zip(m.captures("key"),m.captures("value")))) for m in regex.finditer(rx, s)] )
# => [('my_app', {'ip_address': '10.10.10.10', 'key1': 'value1', 'user_id': 'testuser'})]
See the Python demo online.
The .captures property contains all the values captured into a group at all the iterations.
Not sure, but maybe regular expression might be unnecessary, and splitting similar to,
data='my_app|key1=value1|user_id=testuser|ip_address=10.10.10.10'
x= data.split('|')
appName = []
for index,item in enumerate(x):
if index>0:
element = item.split('=')
temp = {"key":element[0],"value":element[1]}
appName.append(temp)
appName = str(x[0] + ',' + str(appName))
print(appName)
might return an output similar to the desired output:
my_app,[{'key': 'key1', 'value': 'value1'}, {'key': 'user_id', 'value': 'testuser'}, {'key': 'ip_address', 'value': '10.10.10.10'}]
using dict:
temp = {"key":element[0],"value":element[1]}
temp can be modified to other desired data structure that you like to have.

How to build a case-insensitive regular expression with Regexp.union

I have a list of strings, and need to build the regular expression from them, using Regexp#union. I need the resulting pattern to be case insensitive.
The #union method itself does not accept options/modifiers, hence I currently see two options:
strings = %w|one two three|
Regexp.new(Regexp.union(strings).to_s, true)
and/or:
Regexp.union(*strings.map { |s| /#{s}/i })
Both variants look a bit weird.
Is there an ability to construct a case-insensitive regular expression by using Regexp.union?
The simple starting place is:
words = %w[one two three]
/#{ Regexp.union(words).source }/i # => /one|two|three/i
You probably want to make sure you're only matching words so tweak it to:
/\b#{ Regexp.union(words).source }\b/i # => /\bone|two|three\b/i
For cleanliness and clarity I prefer using a non-capturing group:
/\b(?:#{ Regexp.union(words).source })\b/i # => /\b(?:one|two|three)\b/i
Using source is important. When you create a Regexp object, it has an idea of the flags (i, m, x) that apply to that object and those get interpolated into the string:
"#{ /foo/i }" # => "(?i-mx:foo)"
"#{ /foo/ix }" # => "(?ix-m:foo)"
"#{ /foo/ixm }" # => "(?mix:foo)"
or
(/foo/i).to_s # => "(?i-mx:foo)"
(/foo/ix).to_s # => "(?ix-m:foo)"
(/foo/ixm).to_s # => "(?mix:foo)"
That's fine when the generated pattern stands alone, but when it's being interpolated into a string to define other parts of the pattern the flags affect each sub-expression:
/\b(?:#{ Regexp.union(words) })\b/i # => /\b(?:(?-mix:one|two|three))\b/i
Dig into the Regexp documentation and you'll see that ?-mix turns off "ignore-case" inside (?-mix:one|two|three), even though the overall pattern is flagged with i, resulting in a pattern that doesn't do what you want, and is really hard to debug:
'foo ONE bar'[/\b(?:#{ Regexp.union(words) })\b/i] # => nil
Instead, source removes the inner expression's flags making the pattern do what you'd expect:
/\b(?:#{ Regexp.union(words).source })\b/i # => /\b(?:one|two|three)\b/i
and
'foo ONE bar'[/\b(?:#{ Regexp.union(words).source })\b/i] # => "ONE"
You can build your patterns using Regexp.new and passing in the flags:
regexp = Regexp.new('(?:one|two|three)', Regexp::EXTENDED | Regexp::IGNORECASE) # => /(?:one|two|three)/ix
but as the expression becomes more complex it becomes unwieldy. Building a pattern using string interpolation remains more easy to understand.
You've overlooked the obvious.
strings = %w|one two three|
r = Regexp.union(strings.flat_map do |word|
len = word.size
(2**len).times.map { |n|
len.times.map { |i| n[i]==1 ? word[i].upcase : word[i] } }
end.map(&:join))
"'The Three Little Pigs' should be read by every building contractor" =~ r
#=> 5

Strip numbers from JSON response in Rails 3?

I am receiving a JSON post into my Rails 3 application. I am then parsing each of the values and inserting them into the application database. It's all working well but now I would like to modify the received values before inserting them into the database.
:subject => email_payload['subject']
As the above code shows, I am inserting the received value for 'subject' into the column named 'subject'.
In the example above the received value is like this:
Results from Example Company - Surname/Firstname/[12345]
What I would like to do is strip everything out except the numerical value between the []. So the value that's inserted into the database is simply:
12345
I can, presumably, just select anything from 0-9 but how do I add regex to the received string?
None of the following seem to work:
['subject.gsub!([0-9])']
['subject'.gsub!([0-9])]
['subject'].gsub!([0-9])
I've tested the Regex here http://rubular.com/r/AVFkm3A440
Since you are applying the .gsub() to the value returned by the hash key email_payload['subject'], the method belongs chained outside the closing ].
Your regular expression is missing its / delimiters. To capture the group as a whole, add a + as in /[^0-9]+/. The ^ will match all non-digit characters, and then .gsub() will replace them with an empty string. So the pattern below will mutate the key email_payload['subject'] in place
email_payload['subject'] = 'Results from Example Company - Surname/Firstname/[12345]'
email_payload['subject'].gsub!(/[^0-9]+/, '')
>> "12345"
# gsub!() has mutated the value:
puts email_payload['subject']
>> 12345
Examples from Ruby documentation for String:
"hello".gsub(/[aeiou]/, '*') #=> "h*ll*"
"hello".gsub(/([aeiou])/, '<\1>') #=> "h<e>ll<o>"
"hello".gsub(/./) {|s| s.ord.to_s + ' '} #=> "104 101 108 108 111 "
"hello".gsub(/(?<foo>[aeiou])/, '{\k<foo>}') #=> "h{e}ll{o}"
'hello'.gsub(/[eo]/, 'e' => 3, 'o' => '*') #=> "h3ll*"
So gsub is used to substitute pattern globally on the string.
You probably need scan.
"Results from Example Company - Surname/Firstname/[12345]".
scan(/\[(\d*)\]$/).
flatten
>> ["12345"]
This presumes that the digits you want to select come inside [] brackets and that ] is in the end of the string.

Parse labeled param strings with Regex

Can anyone help me with this one?
My objective here is to grab some info from a text file, present the user with it and ask for values to replace that info so to generate a new output. So I thought of using regular expressions.
My variables would be of the format: {#<num>[|<value>]}.
Here are some examples:
{#1}<br>
{#2|label}<br>
{#3|label|help}<br>
{#4|label|help|something else}<br><br>
So after some research and experimenting, I came up with this expression: \{\#(\d{1,})(?:\|{1}(.+))*\}
which works pretty well on most of the ocasions, except when on something like this:
{#1} some text {#2|label} some more text {#3|label|help}
In this case variables 2 & 3 are matched on a single occurrence rather than on 2 separate matches...
I've already tried to use lookahead commands for the trailing } of the expression, but I didn't manage to get it.
I'm targeting this expression for using into C#, should that further help anyone...
I like the results from this one:
\{\#(\d+)(?:|\|(.+?))\}
This returns 3 groups. The second group is the number (1, 2, 3) and the third group is the arguments ('label', 'label|help').
I prefer to remove the * in favor of | in order to capture all the arguments after the first pipe in the last grouping.
A regular expression which can be used would be something like
\{\#(\d+)(?:\|([^|}]+))*\}
This will prevent reading over any closing }.
Another possible solution (with slightly different behaviour) would be to use a non-greedy matcher (.+?) instead of the greedy version (.+).
Note: I also removed the {1} and replaced {1,} with + which are equivalent in your case.
Try this:
\{\#(\d+)(?:\|[^|}]+)*\}
In C#:
MatchCollection matches = Regex.Matches(mystring,
#"\{\#(\d+)(?:\|[^|}]+)*\}");
It prevents the label and help from eating the | or }.
match[0].Value => {#1}
match[0].Groups[0].Value => {#1}
match[0].Groups[1].Value => 1
match[1].Value => {#2|label}
match[1].Groups[0].Value => {#2|label}
match[1].Groups[1].Value => 2
match[2].Value => {#3|label|help}
match[2].Groups[0].Value => {#3|label|help}
match[2].Groups[1].Value => 3