How to match different groups in regex - regex

I have the following string:
"Josua de Grave* (1643-1712)"
Everything before the * is the person's name, the first date 1634 is his birth date, 1712 is the date of his death.
Following this logic I'd like to have 3 match groups for each one of the item. I tried
([a-zA-Z|\s]*)\* (\d{3,4})-(\d{3,4})
"Josua de Grave* (1643-1712)".match(/([a-zA-Z|\s]*)\* (\d{3,4})-(\d{3,4})/)
but that returns nil.
Why is my logic wrong, and what should I do to get the 3 intended match groups.

The additional brackets ( ) around the digit 1643-1712 values needs to be added in your regex pattern so use
([a-zA-Z\s]*)\* \((\d{3,4})-(\d{3,4})\)
// ^^ ^^
since brackets represents the captured group so escape them using \ to match them as a character.

While you can use a pattern, the problem of splitting this into its parts can also be easily done using other Ruby methods:
Using split:
s = "Josua de Grave* (1643-1712)"
name, dates = s.split('*') # => ["Josua de Grave", " (1643-1712)"]
birth, death = dates[2..-2].split('-') # => ["1643", "1712"]
Or, using scan:
*name, birth, death = s.scan(/[[:alnum:]]+/) # => ["Josua", "de", "Grave", "1643", "1712"]
name.join(' ') # => "Josua de Grave"
birth # => "1643"
death # => "1712"
If I was using a pattern, I'd use this:
name, birth, death = /^([^*]+).+?(\d+)-(\d+)/.match(s)[1..3] # => ["Josua de Grave", "1643", "1712"]
name # => "Josua de Grave"
birth # => "1643"
death # => "1712"
/(^[^*]+).+?(\d+)-(\d+)/ means:
^ start at the beginning of the buffer
([^*]+) capture everything not *, where it'll stop capturing
.+? skip the minimum until...
(\d+) the year is matched and captured
- match but don't capture
(\d+) the year is matched and captured
Regexper helps explain it as does Rubular.

r = /\*\s+\(|(?<=\d)\s*-\s*|\)/
"Josua de Grave* (1643-1712)".split r
#=> ["Josua de Grave", "1643", "1712"]
"Sir Winston Leonard Spencer-Churchill* (1874 - 1965)".split r
#=> ["Sir Winston Leonard Spencer-Churchill", "1874", "1965"]
The regular expression can be made self-documenting by writing it in free-spacing mode:
r = /
\*\s+\( # match '*' then >= 1 whitespaces then '('
| # or
(?<=\d) # match is preceded by a digit (positive lookbehind)
\s*-\s* # match >= 0 whitespaces then '-' then >= 0 whitespaces
| # or
\) # match ')'
/x # free-spacing regex definition mode
The positive lookbehind is needed to avoid splitting hyphenated names on hyphens. (The positive lookahead (?=\d), placed after \s*-\s*, could be used instead.)

Related

how to get numbers from array of strings?

I have this array of strings.
["Anyvalue", "Total", "value:", "9,999.00", "Token", " ", "|", " ", "Total", "chain", "value:", "4,948"]
and I'm trying to get numbers in one line of code. I tried many methods but wasn't really helpful as am expecting.
I'm using one with grep method:
array.grep(/\d+/, &:to_i) #[9, 4]
but it returns an array of first integers only. It seems like I have to add something to the pattern but I don't know what.
Or there is another way to grab these numbers in an Array?
you can use:
array.grep(/[\d,]+\.?\d+/)
if you want int:
array.grep(/[\d,]+\.?\d+/).map {_1.gsub(/[^0-9\.]/, '').to_i}
and a faster way (about 5X to 10X):
array.grep(/[\d,]+\.?\d+/).map { _1.delete("^0-9.").to_i }
for a data like:
%w[
,,,4
1
1.2.3.4
-2
1,2,3
9,999.00
4,948
22,956
22,536,129,336
123,456
12.]
use:
data.grep(/^-?\d{1,3}(,\d{3})*(\.?\d+)?$/)
output:
["1", "-2", "9,999.00", "4,948", "22,956", "22,536,129,336", "123,456"]
arr = ["Anyvalue", "Total", "value:", "9,999.00", "Token", " ", "61.4.5",
"|", "chain", "-4,948", "3,25.61", "1,234,567.899"]
rgx = /\A\-?\d{1,3}(?:,\d{3})*(?:\.\d+)?\z/
arr.grep(rgx)
#=> ["9,999.00", "-4,948", "1,234,567.899"]
Regex demo. At the link the regular expression was evaluated with the PCRE regex engine but the results are the same when Ruby's Onigmo engine is used. Also, at the link I've used the anchors ^ and $ (beginning and end of line) instead of \A and \z (beginning and end of string) in order test the regex against multiple strings.
The regular expression can be broken down as follows.
/
\A # match the beginning of the string
\-? # optionally match '-'
\d{1,3} # match between 1 and 3 digits inclusively
(?: # begin a non-capture group
,\d{3} # match a comma followed by 3 digits
)* # end the non-capture group and execute 0 or more times
(?: # begin a non-capture group
\.\d+ # match a period followed by one or more digits
)? # end the non-capture and make it optional
\z # match the end of the string
/
To make the test more robust we could use the methods Kernel::Float, Kernel::Rational and Kernel::Complex, all with the optional argument :exception set to false.
arr = ["Total", "9,999.00", " ", "61.4.5", "23e4", "-234.7e-2", "1+2i",
"3/4", "|", "chain", "-4,948", "3,25.61", "1,234,567.899", "10"]
arr.select { |s| s.match?(rxg) || Float(s, exception: false) ||
Rational(s, exception: false) Complex(s, exception: false) }
#=> ["9,999.00", "23e4", "-234.7e-2", "1+2i", "3/4", "-4,948",
# "1,234,567.899", "10"]
Note that "23e4", "-234.7e-2", "1+2i" and "3/4" are respectively the string representations of an integer, float, complex and rational number.

Regex scan fails

I am trying to parse all money from a string. For example, I want to extract:
['$250,000', '$3.90', '$250,000', '$500,000']
from:
'Up to $250,000………………………………… $3.90 Over $250,000 to $500,000'
The regex:
\$\ ?(\d+\,)*\d+(\.\d*)?
seems to match all money expressions as in this link. However, when I try to scan on Ruby, it fails to give me the desired result.
s # => "Up to $250,000 $3.90 Over $250,000 to $500,000, add$3.70 Over $500,000 to $1,000,000, add..$3.40 Over $1,000,000 to $2,000,000, add...........$2.25\nOver $2,000,000 add ..$2.00"
r # => /\$\ ?(\d+\,)*\d+\.?\d*/
s.scan(r)
# => [["250,"], [nil], ["250,"], ["500,"], [nil], ["500,"], ["000,"], [nil], ["000,"], ["000,"], [nil], ["000,"], [nil]]
From String#scan docs, it looks like this is because of the group. How can I parse all the money in the string?
Let's look at your regular expression, which I'll write in free-spacing mode so I can document it:
r = /
\$ # match a dollar sign
\ ? # optionally match a space (has no effect)
( # begin capture group 1
\d+ # match one or more digits
, # match a comma (need not be escaped)
)* # end capture group 1 and execute it >= 0 times
\d+ # match one or more digits
\.? # optionally match a period
\d* # match zero or more digits
/x # free-spacing regex definition mode
In non-free-spacing mode this would be written as follows.
r = /\$ ?(\d+,)*\d+\.?\d*/
When a regex is defined in free-spacing mode all spaces are stripped out before the regex is evaluated, which is why I had to escape the space. That's not necessary when the regex is not defined in free-spacing mode.
It is nowhere needed to match a space after the dollars sign, so \ ? should be removed. Suppose now we have
r = /\$\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> ["$2.31", "$44.", "$33.607"]
That works, but it is questionable whether you want to match values that do not have exactly two digits after the decimal point.
Now write
r = /\$(\d+,)*\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> [[nil], [nil], [nil]]
To see why this result was obtained examine the doc for String#scan, specifically the last sentence of the first paragraph: " If the pattern contains groups, each individual result is itself an array containing one entry per group.".
We can avoid that problem by changing the capture group to a non-capture group:
r = /\$(?:\d+,)*\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> ["$2.31", "$44.", "$33.607"]
Now consider this:
"$2,241.31 cat $1,2345. dog $33.607".scan r
#=> ["$2,241.31", "$1,2345.", "$33.607"]
which is still not quite right. Try the following.
r = /
\$ # match a dollar sign
\d{1,3} # match one to three digits
(?:,\d{3}) # match ',' then 3 digits in a nc group
* # execute the above nc group >=0 times
(?:\.\d{2}) # match '.' then 2 digits in a nc group
? # optionally match the above nc group
(?![\d,.]) # no following digit, ',' or '.'
/x # free-spacing regex definition mode
"$2,241.31 $2 $1,234 $3,6152 $33.607 $146.27".scan r
#=> ["$2,241.31", "$2", "$1,234", "$146.27"]
(?![\d,.]) is a negative lookahead.
In normal mode this regular expression is written as follows.
r = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?(?![\d,.])/
The following erroneous result would obtain without the negative lookahead at the end of the regex.
r = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?/
"$2,241.31 $2 $1,234 $3,6152 $33.607 $146.27".scan r
#=> ["$2,241.31", "$2", "$1,234", "$3,615", "$33.60",
# "$146.27"]
[3] pry(main)> str = <<EOF
[3] pry(main)* Up to $250,000………………………………… $3.90 Over $250,000 to $500,000, add………………$3.70 Over $500,000 to $1,000,000, add……………..$3.40 Over $1,000,000 to $2,000,000, add……...........$2.25
[3] pry(main)* Over $2,000,000 add …..………………………$2.00
[3] pry(main)* EOF
=> "Up to $250,000………………………………… $3.90 Over $250,000 to $500,000, add………………$3.70 Over $500,000 to $1,000,000, add……………..$3.40 Over $1,000,000 to $2,000,000, add……...........$2.25\nOver $2,000,000 add …..………………………$2.00\n"
[4] pry(main)> str.scan /\$\d+(?:[,.]\d+)*/
=> ["$250,000", "$3.90", "$250,000", "$500,000", "$3.70", "$500,000", "$1,000,000", "$3.40", "$1,000,000", "$2,000,000", "$2.25", "$2,000,000", "$2.00"]
[5] pry(main)>

Regular Expression To Extract Names

I have strings in this form:
"""00.000000 00.000000; X-XX000-0000-0; France; Paris; Street 12a;
00.000000 00.000000; X-XX000-0000-0; Spain; Barcelona; Street 123;"""
I want to get specific data towns above string. How do I get this data??
If you just want to get the city for your given example, you could use a positive lookahead:
\b[^;]+(?=;[^;]+;$)
Explanation
\b # Word boundary
[^;]+ # Match NOT ; one or more times
(?= # Positive lookahead that asserts what follows is
; # Match semicolon
[^;]+ # Match NOT ; one or more times
; # Match ;
$ # Match end of the string
) # Close lookahead
Assuming Python (three quotes-string):
string = """00.000000 00.000000; X-XX000-0000-0; France; Paris; Street 12a;
00.000000 00.000000; X-XX000-0000-0; Spain; Barcelona; Street 123;"""
towns = [part[3] for line in string.split("\n") for part in [line.split("; ")]]
print(towns)
Which yields
['Paris', 'Barcelona']
No regex needed, really.
If you have the city on the 4th field, you can match it using this pattern:
/(?:[^;]*;){3}([^;]*);/
See the demo
[^;]*; you find a field consisting in non-semicolons and ending with a semicolon
(?:...){3} you find it 3 times, but you do not capture it
([^;]*); then you get 4th column matching its content (not the semicolon)

Regex - Extracting a number when preceeded OR followed by a currency sign

if (preg_match_all('((([£€$¥](([ 0-9]([0-9])*)((\.|\,)(\d{2}|\d{1}))|([ 0-9]([0-9])*)))|(([0-9]([0-9])*)((\.|\,)(\d{2}|\d{1})(\s{0}|\s{1}))|([0-9]([0-9])*(\s{0}|\s{1})))[£€$¥]))', $Commande, $matches)) {
$tot1 = $matches[0];
This is my tested solution.
It works for all 4 currencies when sign is placed before or after, with or without a space in between.
It works with a dot or a comma for decimals.
It works without decimal, or with just 1 number after the dot or comma.
It extracts several amounts in the same string in a mix of formats declined above as long as there is a space in between.
I think it covers everything, although I am sure it can be simplified.
It was Needed for an international order form where clients enter the amounts themselves as well as the description in the same field.
You can use a conditional:
if (preg_match_all('~(\$ ?)?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})?(?:[pcm]|bn|[mb]illion)?(?(1)| ?\$)~i', $order, $matches)) {
$tot = $matches[0];
}
Explanation:
I put the currency in the first capturing group: (\$ ?) and I make it optional with a ?
At the end of the pattern, I use an if then else:
(?(1) # if the first capturing group exist
# then match nothing
| # else
[ ]?\$ # matches the currency
) # end of the conditional
You should check for optional $ at the end of amount:
\$? ?(\d[\d ,]*(?:\.\d{1,2})?|\d[\d,](?:\.\d{2})?) ?\$?(?:[pcm]|bn|[mb]illion)
Live demo

How to Capture Only Surnames from a Regex Pattern?

Team
I have written a Perl program to validate the accuracy of formatting (punctuation and the like) of surnames, forenames, and years.
If a particular entry doesn't follow a specified pattern, that entry is highlighted to be fixed.
For example, my input file has lines of similar text:
<bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>
My programs works just fine, that is, if any entry doesn't follow the pattern, the script generates an error. The above input text doesn't generate any error. But the one below is an example of an error because Rose A. J. is missing a comma after Rose:
NOT FOUND: <bibliomixed id="bkrmbib120">Asher, S. R., & Rose A. J. (1997). Promoting children’s social-emotional adjustment with peers. In P. Salovey & D. Sluyter, (Eds). <emphasis>Emotional development and emotional intelligence: Educational implications.</emphasis> New York: Basic Books.</bibliomixed>
From my regex search pattern, is it possible to capture all the surnames and the year, so I can generate a text prefixed to each line as shown below?
<BIB>Abdo, Afif-Abdo, Otani, Machado, 2008</BIB><bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>
My regex search script is as follows:
while(<$INPUT_REF_XML_FH>){
$line_count += 1;
chomp;
if(/
# bibliomixed XML ID tag and attribute----<START>
<bibliomixed
\s+
id=".*?">
# bibliomixed XML ID tag and attribute----<END>
# --------2 OR MORE AUTHOR GROUP--------<START>
(?:
(?:
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
,\s)+
#---------------FINAL AUTHOR GROUP SEPATOR----<START>
&\s
#---------------FINAL AUTHOR GROUP SEPATOR----<END>
# --------2 OR MORE AUTHOR GROUP--------<END>
)?
# --------LAST AUTHOR GROUP--------<START>
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
(?: # pattern for editor notation----<START>
\s\(Ed(?:s)?\.\)\.
)? # pattern for editor notation----<END>
# --------LAST AUTHOR GROUP--------<END>
\s
\(
# pattern for a year----<START>
(?:[A-Za-z]+,\s)? # July, 1999
(?:[A-Za-z]+\s)? # July 1999
(?:[0-9]{4}\/)? # 1999\/2000
(?:\w+\s\d+,\s)?# August 18, 2003
(?:[0-9]{4}|in\spress|manuscript\sin\spreparation) # (1999) (in press) (manuscript in preparation)----REQUIRED
(?:[A-Za-z])? # 1999a
(?:,\s[A-Za-z]+\s[0-9]+)? # 1999, July 2
(?:,\s[A-Za-z]+\s[0-9]+\x{2013}[0-9]+)? # 2002, June 19–25
(?:,\s[A-Za-z]+)? # 1999, Spring
(?:,\s[A-Za-z]+\/[A-Za-z]+)? # 1999, Spring\/Winter
(?:,\s[A-Za-z]+-[A-Za-z]+)? # 2003, Mid-Winter
(?:,\s[A-Za-z]+\s[A-Za-z]+)? # 2007, Anniversary Issue
# pattern for a year----<END>
\)\.
/six){
print $FOUND_REPORT_FH "$line_count\tFOUND: $&\n";
$found_count += 1;
} else{
print $ERROR_REPORT_FH "$line_count\tNOT FOUND: $_\n";
$not_found_count += 1;
}
Thanks for your help,
Prem
Alter this bit
# pattern for surname----<END>
,?\s
This now means an optional , followed by white space. If the Persons surname is "Bunga Bunga" it won't work
All of your subpatterns are non-capturing groups, starting with (?:. This reduces compilation times by a number of factors, one of which being that the subpattern is not captured.
To capture a pattern you merely need to place parenthesis around the part you require to capture. So you could remove the non-capturing assertion ?: or place parens () where you need them. http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
I'm not sure but, from your code I think you may be attempting to use lookahead assertions as, for example, you test for surnames with spaces, if none then test for surnames with hyphens. This will not start from the same point every time, it will either match the first example or not, then move forward to test the next position with the second surname pattern, whether the regex will then test the second name for the first subpattern is what I am unsure of. http://perldoc.perl.org/perlretut.html#Looking-ahead-and-looking-behind
#!usr/bin/perl
use warnings;
use strict;
my $line = '123 456 7antelope89';
$line =~ /^(\d+\s\d+\s)?(\d+\w+\d+)?/;
my ($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7bealzelope89';
$line =~ /(?:\d+\s\d+\s)?(?:\d+\w+\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7canteloupe89';
$line =~ /((?:\d+\s\d+\s))?(?:\d+(\w+)\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
exit 0;
For capturing the whole pattern the first pattern of the third example does not make sense, as this tells the regex to not capture the pattern group while also capturing the pattern group. Where this is useful is in the second pattern which is a fine grained pattern capture, in that the pattern captured is part of a non-capturing group.
a: 123 456 b: 7antelope89
a: nocapture b: nocapture
a: 123 456 b: canteloupe
One little nitpic
id=".*?"
may be better as
id="\w*?"
id names requiring to be _alphanumeric iirc.