Regex capture optional group in any order - regex

I would like to capture groups based on a consecutive occurrence of matched groups in any order. And when one set type is repeated without the alternative set type, the alternative set is returned as nil.
So the following:
"123 dog cat cow 456 678 890 sheep"
Would return the following:
[["123", "dog"], [nil, "cat"], ["456", "cow"], ["678", nil], ["890", sheep]]

A regular expression can get us part of the way, but I do not believe all the way.
r = /
(?: # begin non-capture group
\d+ # match 1+ digits
[ ] # match 1 space
[^ \d]+ # match 1+ chars other than digits and spaces
| # or
[^ \d]+ # match 1+ chars other than digits and spaces
[ ] # match 1 space
\d+ # match 1+ digits
| # or
[^ ]+ # match 1+ chars other than spaces
) # end non-capture group
/x # free-spacing regex definition mode
str = "123 dog cat cow 456 678 890 sheep"
str.scan(r).map do |s|
case s
when /\d [^ \d]/
s.split(' ')
when /[^ \d] \d/
s.split(' ').reverse
when /\d/
[s,nil]
else
[nil,s]
end
end
#=> [["123", "dog"], [nil, "cat"], ["456", "cow"],
# ["678", nil], ["890", "sheep"]]
Note:
str.scan r
#=> ["123 dog", "cat", "cow 456", "678", "890 sheep"]
This regular expression is conventionally written
/(?:\d+ [^ \d]+|[^ \d]+ \d+|[^ ]+)/
Here is another solution that only uses regular expressions incidentally.
def doit(str)
str.gsub(/[^ ]+/).with_object([]) do |s,a|
prev = a.empty? ? [0,'a'] : a.last
case s
when /\A\d+\z/ # all digits
if prev[0].nil?
a[-1][0] = s
else
a << [s,nil]
end
when /\A\D+\z/ # all non-digits
if prev[1].nil?
a[-1][1] = s
else
a << [nil,s]
end
else
raise ArgumentError
end
end
end
doit str
#=> [["123", "dog"], [nil, "cat"], ["456", "cow"], ["678", nil],
# ["890", "sheep"]]
This uses of the form of String#gsub that has no block and therefore returns an enumerator:
enum = str.gsub(/[^ ]+/)
#=> #<Enumerator: "123 dog cat cow 456 678 890 sheep":gsub(/[^ ]+/)>
enum.next
#=> "123"
enum.next
#=> "dog"
...
enum.next
#=> "sheep"
enum.next
#=> StopIteration (iteration reached an end)

Related

Remove only non-leading and non-trailing spaces from a string in Ruby?

I'm trying to write a Ruby method that will return true only if the input is a valid phone number, which means, among other rules, it can have spaces and/or dashes between the digits, but not before or after the digits.
In a sense, I need a method that does the opposite of String#strip! (remove all spaces except leading and trailing spaces), plus the same for dashes.
I've tried using String#gsub!, but when I try to match a space or a dash between digits, then it replaces the digits as well as the space/dash.
Here's an example of the code I'm using to remove spaces. I figure once I know how to do that, it will be the same story with the dashes.
def valid_phone_number?(number)
phone_number_pattern = /^0[^0]\d{8}$/
# remove spaces
number.gsub!(/\d\s+\d/, "")
return number.match?(phone_number_pattern)
end
What happens is if I call the method with the following input:
valid_phone_number?(" 09 777 55 888 ")
I get false because line 5 transforms the number into " 0788 ", i.e. it gets rid of the digits around the spaces as well as the spaces. What I want it to do is just to get rid of the inner spaces, so as to produce " 0977755888 ".
I've tried
number.gsub!(/\d(\s+)\d/, "") and number.gsub!(/\d(\s+)\d/) { |match| "" } to no avail.
Thank you!!
If you want to return a boolean, you might for example use a pattern that accepts leading and trailing spaces, and matches 10 digits (as in your example data) where there can be optional spaces or hyphens in between.
^ *\d(?:[ -]?\d){9} *$
For example
def valid_phone_number?(number)
phone_number_pattern = /^ *\d(?:[ -]*\d){9} *$/
return number.match?(phone_number_pattern)
end
See a Ruby demo and a regex demo.
To remove spaces & hyphen inbetween digits, try:
(?:\d+|\G(?!^)\d+)\K[- ]+(?=\d)
See an online regex demo
(?: - Open non-capture group;
d+ - Match 1+ digits;
| - Or;
\G(?!^)\d+ - Assert position at end of previous match but (negate start-line) with following 1+ digits;
)\K - Close non-capture group and reset matching point;
[- ]+ - Match 1+ space/hyphen;
(?=\d) - Assert position is followed by digits.
p " 09 777 55 888 ".gsub(/(?:\d+|\G(?!^)\d+)\K[- ]+(?=\d)/, '')
Prints: " 0977755888 "
Using a very simple regex (/\d/ tests for a digit):
str = " 09 777 55 888 "
r = str.index(/\d/)..str.rindex(/\d/)
str[r] = str[r].delete(" -")
p str # => " 0977755888 "
Passing a block to gsub is an option, capture groups available as globals:
>> str = " 09 777 55 888 "
# simple, easy to understand
>> str.gsub(/(^\s+)([\d\s-]+?)(\s+$)/){ "#$1#{$2.delete('- ')}#$3" }
=> " 0977755888 "
# a different take on #steenslag's answer, to avoid using range.
>> s = str.dup; s[/^\s+([\d\s-]+?)\s+$/, 1] = s.delete("- "); s
=> " 0977755888 "
Benchmark, not that it matters that much:
n = 1_000_000
puts(Benchmark.bmbm do |x|
# just a match
x.report("match") { n.times {str.match(/^ *\d(?:[ -]*\d){9} *$/) } }
# use regex in []=
x.report("[//]=") { n.times {s = str.dup; s[/^\s+([\d\s-]+?)\s+$/, 1] = s.delete("- "); s } }
# use range in []=
x.report("[..]=") { n.times {s = str.dup; r = s.index(/\d/)..s.rindex(/\d/); s[r] = s[r].delete(" -"); s } }
# block in gsub
x.report("block") { n.times {str.gsub(/(^\s+)([\d\s-]+?)(\s+$)/){ "#$1#{$2.delete('- ')}#$3" }} }
# long regex
x.report("regex") { n.times {str.gsub(/(?:\d+|\G(?!^)\d+)\K[- ]+(?=\d)/, "")} }
end)
Rehearsal -----------------------------------------
match 0.997458 0.000004 0.997462 ( 0.998003)
[//]= 1.822698 0.003983 1.826681 ( 1.827574)
[..]= 3.095630 0.007955 3.103585 ( 3.105489)
block 3.515401 0.003982 3.519383 ( 3.521392)
regex 4.761748 0.007967 4.769715 ( 4.772972)
------------------------------- total: 14.216826sec
user system total real
match 1.031670 0.000000 1.031670 ( 1.032347)
[//]= 1.859028 0.000000 1.859028 ( 1.860013)
[..]= 3.074159 0.003978 3.078137 ( 3.079825)
block 3.751532 0.011982 3.763514 ( 3.765673)
regex 4.634857 0.003972 4.638829 ( 4.641259)

How to write a regex for a date-time string

dateTime = "SATURDAY1200PM1230PMWEEKLY"
Desired Result: "12:00 PM - 12:30 PM"
I tried doing this: let str = "SATURDAY600PM630PMWEEKLY".split(/[^A-Z][0-9]{3,4}(A|P)M/);
But I keep getting an array with chars/numbers. I am unsure if split is the way to go here.
Try a match approach:
var dateTime = "SATURDAY1200PM1230PMWEEKLY";
var ts = dateTime.match(/\d{3,4}[AP]M/g)
.map(x => x.replace(/(\d{1,2})(\d{2})([AP]M)/, "$1:$2 $3"))
.join(" - ");
console.log(ts);
As the programming language was not given I will provide a straightforward solution in Ruby which I expect could be converted easily to most other languages.
str = "SATURDAY1130AM130PMWEEKLY"
rgx = /\A[A-Z]+(\d{1,2})(\d{2})([AP]M)(\d{1,2})(\d{2})([AP]M)[A-Z]+\z/
m = str.match(rgx)
#=> #<MatchData "1130AM130PM" 1:"11" 2:"30" 3:"AM" 4:"1" 5:"30" 6:"PM">
"%s:%s %s - %s:%s %s" % [$1, $2, $3, $4, $5, $6]
#=> "11:30 AM - 1:30 PM"
Demo
The regular expression could be broken down as follows.
\A # match beginning of string
[A-Z]+ # match one or more uppercase letters
(\d{1,2}) # match 1 or 2 digits, save to capture group 1
(\d{2}) # match 2 digits, save to capture group 2
([AP]M) # match 'AM' or 'PM', save to capture group 3
(\d{1,2}) # match 1 or 2 digits, save to capture group 4
(\d{2}) # match 2 digits, save to capture group 5
([AP]M) # match 'AM' or 'PM', save to capture group 6
[A-Z]+ # match one or more uppercase letters
\z # match end of string
The last statement could also be written:
"%s:%s %s - %s:%s %s" % m.captures
#=> "11:30 AM - 1:30 PM"
which of course is specific to Ruby.
Another way is to make use of a language's date-time library. Again, this could be done as follows in Ruby.
require 'time'
s1, s2 = str.scan(/\d{3,4}[AP]M/).map do |s|
s.sub(/(?=\d{2}[AP])/, ' ')
end
#=> ["11 30AM", "1 30PM"]
t1 = DateTime.strptime(s1, '%I %M%p')
#=> #<DateTime: 2022-02-01T11:30:00+00:00
# ((2459612j,41400s,0n),+0s,2299161j)>
t2 = DateTime.strptime(s2, '%I %M%p')
#=> #<DateTime: 2022-02-01T13:30:00+00:00
# ((2459612j,48600s,0n),+0s,2299161j)>
t1.strftime('%l:%M %p') + " - " + t2.strftime('%l:%M %p')
#=> "11:30 AM - 1:30 PM"
If you are wondering why .map do |s| s.sub(/(?=\d{2}[AP])/, ' ') end is needed in calculating s1 and s2 try removing it and changing the format string to '%I%M%p'.
Solution is use match and then convert resoult to your string
let str = "SATURDAY600PM630PMWEEKLY"
.match(/[\d]{3,4}(A|P)M/g)
.map((time) => {
const AMPM = time.slice(-2);
const m = time.slice(-4,-2);
const h = time.slice(0,-4);
return `${h}:${m} ${AMPM}`;
})
.join(' - ')
console.log(str)

Get regex to match multiple instances of the same pattern

So I have this regex - regex101:
\[shortcode ([^ ]*)(?:[ ]?([^ ]*)="([^"]*)")*\]
Trying to match on this string
[shortcode contact param1="test 2" param2="test1"]
Right now, the regex matches this:
[contact, param2, test1]
I would like it to match this:
[contact, param1, test 2, param2, test1]
How can I get regex to match the first instance of the parameters pattern, rather than just the last?
You may use
'~(?:\G(?!^)\s+|\[shortcode\s+(\S+)\s+)([^\s=]+)="([^"]*)"~'
See the regex demo
Details
(?:\G(?!^)\s+|\[shortcode\s+(\S+)\s+) - either the end of the previous match and 1+ whitespaces right after (\G(?!^)\s+) or (|)
\[shortcode - literal string
\s+ - 1+ whitespaces
(\S+) - Group 1: one or more non-whitespace chars
\s+ - 1+ whitespaces
([^\s=]+) - Group 2: 1+ chars other than whitespace and =
=" - a literal substring
([^"]*) - Group 3: any 0+ chars other than "
" - a " char.
PHP demo
$re = '~(?:\G(?!^)\s+|\[shortcode\s+(\S+)\s+)([^\s=]+)="([^"]*)"~';
$str = '[shortcode contact param1="test 2" param2="test1"]';
$res = [];
if (preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0)) {
foreach ($matches as $m) {
array_shift($m);
$res = array_merge($res, array_filter($m));
}
}
print_r($res);
// => Array( [0] => contact [1] => param1 [2] => test 2 [3] => param2 [4] => test1 )
Try using the below regex.
regex101
Below is your use case,
var testString = '[shortcode contact param1="test 2" param2="test1"]';
var regex = /[\w\s]+(?=[\="]|\")/gm;
var found = paragraph.match(regex);
If you log found you will see the result as
["shortcode contact param1", "test 2", " param2", "test1"]
The regex will match all the alphanumeric character including the underscore and blank spaces only if they are followed by =" or ".
I hope this helps.

Remove all instances of sub-string after a different sub-string has occurred N times

I've been attempting to replace the character '-' with 'Z' but only if proceeded by 2 or more 'Z's in the string.
input = c("XX-XXZZXX-XZXXXXX", "XX-XXXZXXZXZXXX", "XXXXXZXXXZXXZX-X",
"XXXZXXXZXZXZXXX", "XZXXX-XXXZXZXXX", "XX-XXX-ZZX", "XXZX-XXZXXX-XZ",
"XZXZXX-XXZXXZXX")
desired_output = c("XX-XXZZXXZXZXXXXX", "XX-XXXZXXZXZXXX", "XXXXXZXXXZXXZXZX",
"XXXZXXXZXZXZXXX", "XZXXX-XXXZXZXXX", "XX-XXX-ZZX", "XXZX-XXZXXXZXZ",
"XZXZXXZXXZXXZXX")
I've had some success in removing everything before or after the second occurrence but can't quite make the gap to replace the needed character while keeping everything else. There's no grantee that either a Z or - will be in the string.
This is not an easy regex, but you still can use it to achieve what you need.
input = c("XX-XXZZXX-XZXXXXX", "XX-XXXZXXZXZXXX", "XXXXXZXXXZXXZX-X",
"XXXZXXXZXZXZXXX", "XZXXX-XXXZXZXXX", "XX-XXX-ZZX", "XXZX-XXZXXX-XZ",
"XZXZXX-XXZXXZXX")
gsub("(?:^([^Z]*Z){2}|(?!^)\\G)[^-]*\\K-", "Z", input, perl=T)
See IDEONE demo
The regex just matches two chunks ending with Z (to make sure there are two Zs from the beginning), thenany characters but a hyphen and a hyphen. Only the hyphen is replaced with gsub because we omit what we matched with the \K operator. We match all subsequent hyphens due to \G operator that matches the location after the previous successful match.
Explanation:
(?:^([^Z]*Z){2}|(?!^)\\G) - match 2 alternatives:
^([^Z]*Z){2} - start of string (^) followed by 2 occurrences ({2}) of substrings that contain 0 or more characters other than Z ([^Z]*) followed by Z or...
(?!^)\\G - end of the previous successful match
[^-]*\\K - match 0 or more characters other than - 0 or more times and omit the whole matched text with \K
- - a literal hyphen that will be replaced with Z.
The perl=T is required here.
Way out of my league in regex here as demonstrated by #stribizhev's answer, but you can do this without regular expressions by simply splitting the entire string, counting up the occurrences of Z, and subbing out subsequent -:
input = c("XX-XXZZXX-XZXXXXX", "XX-XXXZXXZXZXXX", "XXXXXZXXXZXXZX-X",
"XXXZXXXZXZXZXXX", "XZXXX-XXXZXZXXX", "XX-XXX-ZZX", "XXZX-XXZXXX-XZ",
"XZXZXX-XXZXXZXX")
desired_output = c("XX-XXZZXXZXZXXXXX", "XX-XXXZXXZXZXXX", "XXXXXZXXXZXXZXZX",
"XXXZXXXZXZXZXXX", "XZXXX-XXXZXZXXX", "XX-XXX-ZZX", "XXZX-XXZXXXZXZ",
"XZXZXXZXXZXXZXX")
sp <- strsplit(input, '')
f <- function(x, n = 2) {
x[x == '-' & (cumsum(x == 'Z') >= n)] <- 'Z'
paste0(x, collapse = '')
}
identical(res <- sapply(sp, f), desired_output)
# [1] TRUE
cbind(input, res, desired_output)
# input res desired_output
# [1,] "XX-XXZZXX-XZXXXXX" "XX-XXZZXXZXZXXXXX" "XX-XXZZXXZXZXXXXX"
# [2,] "XX-XXXZXXZXZXXX" "XX-XXXZXXZXZXXX" "XX-XXXZXXZXZXXX"
# [3,] "XXXXXZXXXZXXZX-X" "XXXXXZXXXZXXZXZX" "XXXXXZXXXZXXZXZX"
# [4,] "XXXZXXXZXZXZXXX" "XXXZXXXZXZXZXXX" "XXXZXXXZXZXZXXX"
# [5,] "XZXXX-XXXZXZXXX" "XZXXX-XXXZXZXXX" "XZXXX-XXXZXZXXX"
# [6,] "XX-XXX-ZZX" "XX-XXX-ZZX" "XX-XXX-ZZX"
# [7,] "XXZX-XXZXXX-XZ" "XXZX-XXZXXXZXZ" "XXZX-XXZXXXZXZ"
# [8,] "XZXZXX-XXZXXZXX" "XZXZXXZXXZXXZXX" "XZXZXXZXXZXXZXX"

Regex that matches specific spaces

I've been trying to do this Regex for a while now. I'd like to create one that matches all the spaces of a text, except those in literal string.
Exemple:
123 Foo "String with spaces"
Space between 123 and Foo would match, as well as the one between Foo and "String with spaces", but only those two.
Thanks
A common, simple strategy for this is to count the number of quotes leading up to your location in the string. If the count is odd, you are inside a quoted string; if the amount is even, you are outside a quoted string. I can't think of a way to do this in regular expressions, but you could use this strategy to filter the results.
You could use re.findall to match either a string or a space and then afterwards inspect the matches:
import re
hits = re.findall("\"(?:\\\\.|[^\\\"])*\"|[ ]", 'foo bar baz "another\\" test\" and done')
for h in hits:
print "found: [%s]" % h
yields:
found: [ ]
found: [ ]
found: [ ]
found: ["another\" test"]
found: [ ]
found: [ ]
A short explanation:
" # match a double quote
(?: # start non-capture group 1
\\\\. # match a backslash followed by any character (except line breaks)
| # OR
[^\\\"] # match any character except a '\' and '"'
)* # end non-capture group 1 and repeat it zero or more times
" # match a double quote
| # OR
[ ] # match a single space
If this ->123 Foo "String with spaces" <- is your structure for a line that is to say text followed by a quoted text you could create 2 groups the quoted and the unquoted text and an tackle them separately.
ex.regex -> (.*)(".*") where $1 should contain ->123 Foo <- and $2 ->"String with spaces"<-
java example.
String aux = "123 Foo \"String with spaces\"";
String regex = "(.*)(\".*\")";
String unquoted = aux.replaceAll(regex, "$1").replace(" ", "");
String quoted = aux.replaceAll(regex, "$2");
System.out.println(unquoted+quoted);
javascript example.
<SCRIPT LANGUAGE="JavaScript">
<!--
str='1 23 Foo \"String with spaces\"';
re = new RegExp('(.*)(".*")') ;
var quoted = str.replace(re, "$1");
var unquoted = str.replace(re, "$2");
document.write (quoted.split(' ').join('')+unquoted);
// -->
</SCRIPT>