Regex scan fails - regex

I am trying to parse all money from a string. For example, I want to extract:
['$250,000', '$3.90', '$250,000', '$500,000']
from:
'Up to $250,000………………………………… $3.90 Over $250,000 to $500,000'
The regex:
\$\ ?(\d+\,)*\d+(\.\d*)?
seems to match all money expressions as in this link. However, when I try to scan on Ruby, it fails to give me the desired result.
s # => "Up to $250,000 $3.90 Over $250,000 to $500,000, add$3.70 Over $500,000 to $1,000,000, add..$3.40 Over $1,000,000 to $2,000,000, add...........$2.25\nOver $2,000,000 add ..$2.00"
r # => /\$\ ?(\d+\,)*\d+\.?\d*/
s.scan(r)
# => [["250,"], [nil], ["250,"], ["500,"], [nil], ["500,"], ["000,"], [nil], ["000,"], ["000,"], [nil], ["000,"], [nil]]
From String#scan docs, it looks like this is because of the group. How can I parse all the money in the string?

Let's look at your regular expression, which I'll write in free-spacing mode so I can document it:
r = /
\$ # match a dollar sign
\ ? # optionally match a space (has no effect)
( # begin capture group 1
\d+ # match one or more digits
, # match a comma (need not be escaped)
)* # end capture group 1 and execute it >= 0 times
\d+ # match one or more digits
\.? # optionally match a period
\d* # match zero or more digits
/x # free-spacing regex definition mode
In non-free-spacing mode this would be written as follows.
r = /\$ ?(\d+,)*\d+\.?\d*/
When a regex is defined in free-spacing mode all spaces are stripped out before the regex is evaluated, which is why I had to escape the space. That's not necessary when the regex is not defined in free-spacing mode.
It is nowhere needed to match a space after the dollars sign, so \ ? should be removed. Suppose now we have
r = /\$\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> ["$2.31", "$44.", "$33.607"]
That works, but it is questionable whether you want to match values that do not have exactly two digits after the decimal point.
Now write
r = /\$(\d+,)*\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> [[nil], [nil], [nil]]
To see why this result was obtained examine the doc for String#scan, specifically the last sentence of the first paragraph: " If the pattern contains groups, each individual result is itself an array containing one entry per group.".
We can avoid that problem by changing the capture group to a non-capture group:
r = /\$(?:\d+,)*\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> ["$2.31", "$44.", "$33.607"]
Now consider this:
"$2,241.31 cat $1,2345. dog $33.607".scan r
#=> ["$2,241.31", "$1,2345.", "$33.607"]
which is still not quite right. Try the following.
r = /
\$ # match a dollar sign
\d{1,3} # match one to three digits
(?:,\d{3}) # match ',' then 3 digits in a nc group
* # execute the above nc group >=0 times
(?:\.\d{2}) # match '.' then 2 digits in a nc group
? # optionally match the above nc group
(?![\d,.]) # no following digit, ',' or '.'
/x # free-spacing regex definition mode
"$2,241.31 $2 $1,234 $3,6152 $33.607 $146.27".scan r
#=> ["$2,241.31", "$2", "$1,234", "$146.27"]
(?![\d,.]) is a negative lookahead.
In normal mode this regular expression is written as follows.
r = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?(?![\d,.])/
The following erroneous result would obtain without the negative lookahead at the end of the regex.
r = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?/
"$2,241.31 $2 $1,234 $3,6152 $33.607 $146.27".scan r
#=> ["$2,241.31", "$2", "$1,234", "$3,615", "$33.60",
# "$146.27"]

[3] pry(main)> str = <<EOF
[3] pry(main)* Up to $250,000………………………………… $3.90 Over $250,000 to $500,000, add………………$3.70 Over $500,000 to $1,000,000, add……………..$3.40 Over $1,000,000 to $2,000,000, add……...........$2.25
[3] pry(main)* Over $2,000,000 add …..………………………$2.00
[3] pry(main)* EOF
=> "Up to $250,000………………………………… $3.90 Over $250,000 to $500,000, add………………$3.70 Over $500,000 to $1,000,000, add……………..$3.40 Over $1,000,000 to $2,000,000, add……...........$2.25\nOver $2,000,000 add …..………………………$2.00\n"
[4] pry(main)> str.scan /\$\d+(?:[,.]\d+)*/
=> ["$250,000", "$3.90", "$250,000", "$500,000", "$3.70", "$500,000", "$1,000,000", "$3.40", "$1,000,000", "$2,000,000", "$2.25", "$2,000,000", "$2.00"]
[5] pry(main)>

Related

how to get numbers from array of strings?

I have this array of strings.
["Anyvalue", "Total", "value:", "9,999.00", "Token", " ", "|", " ", "Total", "chain", "value:", "4,948"]
and I'm trying to get numbers in one line of code. I tried many methods but wasn't really helpful as am expecting.
I'm using one with grep method:
array.grep(/\d+/, &:to_i) #[9, 4]
but it returns an array of first integers only. It seems like I have to add something to the pattern but I don't know what.
Or there is another way to grab these numbers in an Array?
you can use:
array.grep(/[\d,]+\.?\d+/)
if you want int:
array.grep(/[\d,]+\.?\d+/).map {_1.gsub(/[^0-9\.]/, '').to_i}
and a faster way (about 5X to 10X):
array.grep(/[\d,]+\.?\d+/).map { _1.delete("^0-9.").to_i }
for a data like:
%w[
,,,4
1
1.2.3.4
-2
1,2,3
9,999.00
4,948
22,956
22,536,129,336
123,456
12.]
use:
data.grep(/^-?\d{1,3}(,\d{3})*(\.?\d+)?$/)
output:
["1", "-2", "9,999.00", "4,948", "22,956", "22,536,129,336", "123,456"]
arr = ["Anyvalue", "Total", "value:", "9,999.00", "Token", " ", "61.4.5",
"|", "chain", "-4,948", "3,25.61", "1,234,567.899"]
rgx = /\A\-?\d{1,3}(?:,\d{3})*(?:\.\d+)?\z/
arr.grep(rgx)
#=> ["9,999.00", "-4,948", "1,234,567.899"]
Regex demo. At the link the regular expression was evaluated with the PCRE regex engine but the results are the same when Ruby's Onigmo engine is used. Also, at the link I've used the anchors ^ and $ (beginning and end of line) instead of \A and \z (beginning and end of string) in order test the regex against multiple strings.
The regular expression can be broken down as follows.
/
\A # match the beginning of the string
\-? # optionally match '-'
\d{1,3} # match between 1 and 3 digits inclusively
(?: # begin a non-capture group
,\d{3} # match a comma followed by 3 digits
)* # end the non-capture group and execute 0 or more times
(?: # begin a non-capture group
\.\d+ # match a period followed by one or more digits
)? # end the non-capture and make it optional
\z # match the end of the string
/
To make the test more robust we could use the methods Kernel::Float, Kernel::Rational and Kernel::Complex, all with the optional argument :exception set to false.
arr = ["Total", "9,999.00", " ", "61.4.5", "23e4", "-234.7e-2", "1+2i",
"3/4", "|", "chain", "-4,948", "3,25.61", "1,234,567.899", "10"]
arr.select { |s| s.match?(rxg) || Float(s, exception: false) ||
Rational(s, exception: false) Complex(s, exception: false) }
#=> ["9,999.00", "23e4", "-234.7e-2", "1+2i", "3/4", "-4,948",
# "1,234,567.899", "10"]
Note that "23e4", "-234.7e-2", "1+2i" and "3/4" are respectively the string representations of an integer, float, complex and rational number.

How to match different groups in regex

I have the following string:
"Josua de Grave* (1643-1712)"
Everything before the * is the person's name, the first date 1634 is his birth date, 1712 is the date of his death.
Following this logic I'd like to have 3 match groups for each one of the item. I tried
([a-zA-Z|\s]*)\* (\d{3,4})-(\d{3,4})
"Josua de Grave* (1643-1712)".match(/([a-zA-Z|\s]*)\* (\d{3,4})-(\d{3,4})/)
but that returns nil.
Why is my logic wrong, and what should I do to get the 3 intended match groups.
The additional brackets ( ) around the digit 1643-1712 values needs to be added in your regex pattern so use
([a-zA-Z\s]*)\* \((\d{3,4})-(\d{3,4})\)
// ^^ ^^
since brackets represents the captured group so escape them using \ to match them as a character.
While you can use a pattern, the problem of splitting this into its parts can also be easily done using other Ruby methods:
Using split:
s = "Josua de Grave* (1643-1712)"
name, dates = s.split('*') # => ["Josua de Grave", " (1643-1712)"]
birth, death = dates[2..-2].split('-') # => ["1643", "1712"]
Or, using scan:
*name, birth, death = s.scan(/[[:alnum:]]+/) # => ["Josua", "de", "Grave", "1643", "1712"]
name.join(' ') # => "Josua de Grave"
birth # => "1643"
death # => "1712"
If I was using a pattern, I'd use this:
name, birth, death = /^([^*]+).+?(\d+)-(\d+)/.match(s)[1..3] # => ["Josua de Grave", "1643", "1712"]
name # => "Josua de Grave"
birth # => "1643"
death # => "1712"
/(^[^*]+).+?(\d+)-(\d+)/ means:
^ start at the beginning of the buffer
([^*]+) capture everything not *, where it'll stop capturing
.+? skip the minimum until...
(\d+) the year is matched and captured
- match but don't capture
(\d+) the year is matched and captured
Regexper helps explain it as does Rubular.
r = /\*\s+\(|(?<=\d)\s*-\s*|\)/
"Josua de Grave* (1643-1712)".split r
#=> ["Josua de Grave", "1643", "1712"]
"Sir Winston Leonard Spencer-Churchill* (1874 - 1965)".split r
#=> ["Sir Winston Leonard Spencer-Churchill", "1874", "1965"]
The regular expression can be made self-documenting by writing it in free-spacing mode:
r = /
\*\s+\( # match '*' then >= 1 whitespaces then '('
| # or
(?<=\d) # match is preceded by a digit (positive lookbehind)
\s*-\s* # match >= 0 whitespaces then '-' then >= 0 whitespaces
| # or
\) # match ')'
/x # free-spacing regex definition mode
The positive lookbehind is needed to avoid splitting hyphenated names on hyphens. (The positive lookahead (?=\d), placed after \s*-\s*, could be used instead.)

Regex for receipt items

I have a simple receipt which I typed out. I need to be able to read the items purchased on the receipt. The sample receipt is below.
Tim Hortons
Alwasy Fresh
1 Brek Wrap Combo /A ($0.76)
1 Bacon-wrap $3.79
1 Grilled $0.00
1 5 Pieces Bacon-wrap $0.00
1 Orange $1.40
1 Deposit $0.10
Subtotal: $55.84
GST: $0.29
Debit: $55.84
Take out
Thanks for stopping by!!
Tell us how we did
I came up with the following regex string to find the items.
\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}
It works for the most part but there are a few incorrect lines like
4
GST: $0.29
Can someone come up with a better pattern. Below is a link to see it in action.
http://regexr.com/3cnk9
I see a number of problems with this original regex:
\d(\s){1,10}(.)*\s{1,}\$\d\.[0-9]{2}
First, parentheses both group and match, though when you quantify your match, only the last iteration is captured, so matching like (.)* will only store the last character; you wanted (.*) for that. Since it's greedy, that will be the character before the space preceding a dollar sign, which given your data will always be a space. Similarly, you're quantifying a group at the beginning with (\s){1,10}, which captures only the last whitespace character. In this case, you don't need the group since \s is a single space character, so you can simply use \s{1,10}.
Here is a piece-by-piece explanation of what that regular expression does.
Capturing solution
The following regex captures the quantity ($1), item description ($2), whether the price is parenthesized ($3), and the price ($4):
^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$
Explained and matched to your sample at regex101.
Separated out and commented (assumes the /x flag is supported):
/ # begin regex
^\s* # start of line, ignore leading spaces if present
(\d+) # $1 = quantity
\s+ # spacing as a delimiter
(.*\S) # $2 = item: contains anything, must end in a non-space char
\s+ # spacing as a delimiter
(\(?) # $3 = negation, an optional open parenthesis
\$ # dollar sign
([0-9.]+) # $4 = price
\)?\s*$ # trailing characters: optional end-paren and space(s)
/x # end regex, multi-line regex flag
with sample perl code executed from a command line:
perl -ne '
my ($quantity, $item, $neg, $price)
= /^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$/;
if ($item) {
if ($neg) { $price *= -1; }
print "<$quantity><$item><$price>\n"
}' RECEIPT_FILE
(If you want that as a perl script, wrap the code with while(<>) { } and you're done.)
This assigns the variables $quantity, $item, and $price to the itemized lines on your receipt. I am assuming that a parenthesized item is to be subtracted (but I can't verify that since the totals are nonsensical), so $neg notes the existence of a parenthesis so the $price can be negated.
I set the output to use angle brackets (< and >) to indicate what each variable stores.
The output of your given sample receipt would therefore be:
<1><Brek Wrap Combo /A><-0.76>
<1><Bacon-wrap><3.79>
<1><Grilled><0.00>
<1><5 Pieces Bacon-wrap><0.00>
<1><Orange><1.40>
<1><Deposit><0.10>
Prices only solution
You didn't say what you wanted to match. If you don't care about anything but the prices and there are no negative values, you don't need matchers if you have negative look-behind or \K:
grep -Po '^\s*[0-9].*\$\K[0-9.]+' RECEIPT_FILE
Grep's -P flag invokes libpcre (which may not be available if you're on an old or embedded system) and -o displays only the matching text. \K denotes the start of the match. Put the \$ after the \K if you want to capture it. (See also the regex101 description and matches.)
Output from that grep command:
0.76
3.79
0.00
0.00
1.40
0.10
Prices only – with awk
There aren't great ways to handle this regex with efficiency. If you're processing through a mountain of content, you'll feel the hurt. Here's a solution using awk that should be significantly faster. (The difference won't be noticeable with a small input.)
awk '$1 / 1 > 0 && $NF ~ /\$/ { gsub(/[()]/, "", $0); print $NF; }' RECEIPT_FILE
Commented version with explanation:
awk '
# if the quantity is indeed a number and the last field has a dollar sign
$1 / 1 > 0 && $NF ~ /\$/ {
gsub(/[()]/, "", $NF); # remove all parentheses from the last field
print $NF; # print the contents of the last field
}' RECEIPT_FILE
Prices only – with awk, supporting negative prices
awk '
# if the quantity is indeed a number and the last field has a dollar sign
$1 / 1 > 0 && $NF ~ /\$/ {
neg = 1;
if ( $NF ~ /\(/ ) { # the last field has an open parenthesis
gsub(/[()]/, "", $NF); # remove all parentheses from the last field
neg = -1;
}
print $NF * neg; # print the last field, negated if parenthesized
}' RECEIPT_FILE
Here's my attempt:
^(\d+)\s+(.*)\s+\(?(\$.+)\)?$
Stub. Remember to turn the multiline option on. Components:
^ - beginning of line
(\d+) - capture the quantity at the beginning of each line item
\s+ - one or more space
(.*) - capture the item description
\s+ - one or more space
\(? - optional open bracket `(` character
($.+) - capture anything including and after the dollar sign
\)? - optional close bracket `)` character
$ - end of line
You can use
^(\d+)\s+(.*?)\s+\(?\$(\d+\.\d+)
See the regex demo
This regex should be used with the /m modifier to match data on different lines. In JS, the /g modifier is also required.
Explanation:
^ - start of a line
(\d+) - Group 1 capturing one or more digits
\s+ - one or more whitespaces
(.*?) - Group 2 capturing zero or more any characters but a newline up to the closest
\s+ - one or more whitespaces
\(? - an optional ( (on the first line)
\$ - a literal $
(\d+\.\d+) - Group 3 capturing one or more digits followed with . and one or more digits.
JS demo:
var re = /^(\d+)\s+(.*?)\s+\(?\$(\d+\.\d+)/gm;
var str = ' Tim Hortons\n Alwasy Fresh\n\n1 Brek Wrap Combo /A ($0.76)\n1 Bacon-wrap $3.79\n1 Grilled $0.00\n1 5 Pieces Bacon-wrap $0.00\n1 Orange $1.40\n1 Deposit $0.10\nSubtotal: $55.84\nGST: $0.29\nDebit: $55.84\nTake out\n\n Thanks for stopping by!!\n Tell us how we did';
while ((m = re.exec(str)) !== null) {
document.body.innerHTML += "Pcs: <b>" + m[1] + "</b>, item: <b>" + m[2] + "</b>, paid: <b>" + m[3] + "</b><br/>";
}
Adam Katz's answer should be the accepted one! I used this variation of his answer for an implementation in JavaScript:
const receiptRegex = /^\s*(\d+)\s+(.*\S)\s+(\(?)\$([0-9.]+)\)?\s*$/gm
let items = [];
const matches = inputStr.matchAll(receiptRegex);
for (const matchedGroup of matches) {
const [
fullString, //[0] -> matched string "1 Blue gatorade $2.00"
quantity, //[1] -> quantity "1"
item, //[2] -> item description "Blue gatorade"
ignoredSymbol, //[3] -> "$" (should probably always ignore)
price //[4] -> amount "2.00"
] = matchedGroup;
items.push({
quantity,
item,
price,
});
}

How to Capture Only Surnames from a Regex Pattern?

Team
I have written a Perl program to validate the accuracy of formatting (punctuation and the like) of surnames, forenames, and years.
If a particular entry doesn't follow a specified pattern, that entry is highlighted to be fixed.
For example, my input file has lines of similar text:
<bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>
My programs works just fine, that is, if any entry doesn't follow the pattern, the script generates an error. The above input text doesn't generate any error. But the one below is an example of an error because Rose A. J. is missing a comma after Rose:
NOT FOUND: <bibliomixed id="bkrmbib120">Asher, S. R., & Rose A. J. (1997). Promoting children’s social-emotional adjustment with peers. In P. Salovey & D. Sluyter, (Eds). <emphasis>Emotional development and emotional intelligence: Educational implications.</emphasis> New York: Basic Books.</bibliomixed>
From my regex search pattern, is it possible to capture all the surnames and the year, so I can generate a text prefixed to each line as shown below?
<BIB>Abdo, Afif-Abdo, Otani, Machado, 2008</BIB><bibliomixed id="bkrmbib5">Abdo, C., Afif-Abdo, J., Otani, F., & Machado, A. (2008). Sexual satisfaction among patients with erectile dysfunction treated with counseling, sildenafil, or both. <emphasis>Journal of Sexual Medicine</emphasis>, <emphasis>5</emphasis>, 1720–1726.</bibliomixed>
My regex search script is as follows:
while(<$INPUT_REF_XML_FH>){
$line_count += 1;
chomp;
if(/
# bibliomixed XML ID tag and attribute----<START>
<bibliomixed
\s+
id=".*?">
# bibliomixed XML ID tag and attribute----<END>
# --------2 OR MORE AUTHOR GROUP--------<START>
(?:
(?:
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
,\s)+
#---------------FINAL AUTHOR GROUP SEPATOR----<START>
&\s
#---------------FINAL AUTHOR GROUP SEPATOR----<END>
# --------2 OR MORE AUTHOR GROUP--------<END>
)?
# --------LAST AUTHOR GROUP--------<START>
# pattern for surname----<START>
(?:(?:[\w\x{2019}|\x{0027}]+\s)+)? # surnames with spaces
(?:(?:[\w\x{2019}|\x{0027}]+-)+)? # surnames with hyphens
(?:[A-Z](?:\x{2019}|\x{0027}))? # surnames with closing single quote or apostrophe O’Leary
(?:St\.\s)? # pattern for St.
(?:\w+-\w+\s)?# pattern for McGillicuddy-De Lisi
(?:[\w\x{2019}|\x{0027}]+) # final surname pattern----REQUIRED
# pattern for surname----<END>
,\s
# pattern for forename----<START>
(?:
(?:(?:[A-Z]\.\s)+)? #initials with periods
(?:[A-Z]\.-)? #initials with hyphens and periods <<Y.-C. L.>>
(?:(?:[A-Z]\.\s)+)? #initials with periods
[A-Z]\. #----REQUIRED
# pattern for titles....<START>
(?:,\s(?:Jr\.|Sr\.|II|III|IV))?
# pattern for titles....<END>
)
# pattern for forename----<END>
(?: # pattern for editor notation----<START>
\s\(Ed(?:s)?\.\)\.
)? # pattern for editor notation----<END>
# --------LAST AUTHOR GROUP--------<END>
\s
\(
# pattern for a year----<START>
(?:[A-Za-z]+,\s)? # July, 1999
(?:[A-Za-z]+\s)? # July 1999
(?:[0-9]{4}\/)? # 1999\/2000
(?:\w+\s\d+,\s)?# August 18, 2003
(?:[0-9]{4}|in\spress|manuscript\sin\spreparation) # (1999) (in press) (manuscript in preparation)----REQUIRED
(?:[A-Za-z])? # 1999a
(?:,\s[A-Za-z]+\s[0-9]+)? # 1999, July 2
(?:,\s[A-Za-z]+\s[0-9]+\x{2013}[0-9]+)? # 2002, June 19–25
(?:,\s[A-Za-z]+)? # 1999, Spring
(?:,\s[A-Za-z]+\/[A-Za-z]+)? # 1999, Spring\/Winter
(?:,\s[A-Za-z]+-[A-Za-z]+)? # 2003, Mid-Winter
(?:,\s[A-Za-z]+\s[A-Za-z]+)? # 2007, Anniversary Issue
# pattern for a year----<END>
\)\.
/six){
print $FOUND_REPORT_FH "$line_count\tFOUND: $&\n";
$found_count += 1;
} else{
print $ERROR_REPORT_FH "$line_count\tNOT FOUND: $_\n";
$not_found_count += 1;
}
Thanks for your help,
Prem
Alter this bit
# pattern for surname----<END>
,?\s
This now means an optional , followed by white space. If the Persons surname is "Bunga Bunga" it won't work
All of your subpatterns are non-capturing groups, starting with (?:. This reduces compilation times by a number of factors, one of which being that the subpattern is not captured.
To capture a pattern you merely need to place parenthesis around the part you require to capture. So you could remove the non-capturing assertion ?: or place parens () where you need them. http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
I'm not sure but, from your code I think you may be attempting to use lookahead assertions as, for example, you test for surnames with spaces, if none then test for surnames with hyphens. This will not start from the same point every time, it will either match the first example or not, then move forward to test the next position with the second surname pattern, whether the regex will then test the second name for the first subpattern is what I am unsure of. http://perldoc.perl.org/perlretut.html#Looking-ahead-and-looking-behind
#!usr/bin/perl
use warnings;
use strict;
my $line = '123 456 7antelope89';
$line =~ /^(\d+\s\d+\s)?(\d+\w+\d+)?/;
my ($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7bealzelope89';
$line =~ /(?:\d+\s\d+\s)?(?:\d+\w+\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
$line = '123 456 7canteloupe89';
$line =~ /((?:\d+\s\d+\s))?(?:\d+(\w+)\d+)?/;
($ay,$be) = ($1 ? $1:'nocapture ', $2 ? $2:'nocapture ');
print 'a: ',$ay,'b: ',$be,$/;
undef for ($ay,$be,$1,$2);
exit 0;
For capturing the whole pattern the first pattern of the third example does not make sense, as this tells the regex to not capture the pattern group while also capturing the pattern group. Where this is useful is in the second pattern which is a fine grained pattern capture, in that the pattern captured is part of a non-capturing group.
a: 123 456 b: 7antelope89
a: nocapture b: nocapture
a: 123 456 b: canteloupe
One little nitpic
id=".*?"
may be better as
id="\w*?"
id names requiring to be _alphanumeric iirc.

Regular expression to find unescaped double quotes in CSV file

What would a regular expression be to find sets of 2 unescaped double quotes that are contained in columns set off by double quotes in a CSV file?
Not a match:
"asdf","asdf"
"", "asdf"
"asdf", ""
"adsf", "", "asdf"
Match:
"asdf""asdf", "asdf"
"asdf", """asdf"""
"asdf", """"
Try this:
(?m)""(?![ \t]*(,|$))
Explanation:
(?m) // enable multi-line matching (^ will act as the start of the line and $ will act as the end of the line (i))
"" // match two successive double quotes
(?! // start negative look ahead
[ \t]* // zero or more spaces or tabs
( // open group 1
, // match a comma
| // OR
$ // the end of the line or string
) // close group 1
) // stop negative look ahead
So, in plain English: "match two successive double quotes, only if they DON'T have a comma or end-of-the-line ahead of them with optionally spaces and tabs in between".
(i) besides being the normal start-of-the-string and end-of-the-string meta characters.
Due to the complexity of your problem, the solution depends on the engine you are using. This because to solve it you must use look behind and look ahead and each engine is not the same one this.
My answer is using Ruby engine. The checking is just one RegEx but I out the whole code here for better explain it.
NOTE that, due to Ruby RegEx engine (or my knowledge), optional look ahead/behind is not possible. So I need a small problem of spaces before and after comma.
Here is my code:
orgTexts = [
'"asdf","asdf"',
'"", "asdf"',
'"asdf", ""',
'"adsf", "", "asdf"',
'"asdf""asdf", "asdf"',
'"asdf", """asdf"""',
'"asdf", """"'
]
orgTexts.each{|orgText|
# Preprocessing - Eliminate spaces before and after comma
# Here is needed if you may have spaces before and after a valid comma
orgText = orgText.gsub(Regexp.new('\" *, *\"'), '","')
# Detect valid character (non-quote and valid quote)
resText = orgText.gsub(Regexp.new('([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")'), '-')
# resText = orgText.gsub(Regexp.new('([^\"]|(^|(?<=,)|(?<=\\\\))\"|\"($|(?=,)))'), '-')
# [^\"] ===> A non qoute
# | ===> or
# ^\" ===> beginning quot
# | ===> or
# \"$ ===> endding quot
# | ===> or
# (?<=,)\" ===> quot just after comma
# \"(?=,) ===> quot just before comma
# (?<=\\\\)\" ===> escaped quot
# This part is to show the invalid non-escaped quots
print orgText
print resText.gsub(Regexp.new('"'), '^')
# This part is to determine if there is non-escaped quotes
# Here is the actual matching, use this one if you don't want to know which quote is un-escaped
isMatch = ((orgText =~ /^([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")*$/) != 0).to_s
# Basicall, it match it from start to end (^...$) there is only a valid character
print orgText + ": " + isMatch
print
print ""
print ""
}
When executed the code prints:
"asdf","asdf"
-------------
"asdf","asdf": false
"","asdf"
---------
"","asdf": false
"asdf",""
---------
"asdf","": false
"adsf","","asdf"
----------------
"adsf","","asdf": false
"asdf""asdf","asdf"
-----^^------------
"asdf""asdf","asdf": true
"asdf","""asdf"""
--------^^----^^-
"asdf","""asdf""": true
"asdf",""""
--------^^-
"asdf","""": true
I hope I give you some idea here that you can use with other engine and language.
".*"(\n|(".*",)*)
should work, I guess...
For single-line matches:
^("[^"]*"\s*,\s*)*"[^"]*""[^"]*"
or for multi-line:
(^|\r\n)("[^\r\n"]*"\s*,\s*)*"[^\r\n"]*""[^\r\n"]*"
Edit/Note: Depending on the regex engine used, you could use lookbehinds and other stuff to make the regex leaner. But this should work in most regex engines just fine.
Try this regular expression:
"(?:[^",\\]*|\\.)*(?:""(?:[^",\\]*|\\.)*)+"
That will match any quoted string with at least one pair of unescaped double quotes.