how to get numbers from array of strings? - regex

I have this array of strings.
["Anyvalue", "Total", "value:", "9,999.00", "Token", " ", "|", " ", "Total", "chain", "value:", "4,948"]
and I'm trying to get numbers in one line of code. I tried many methods but wasn't really helpful as am expecting.
I'm using one with grep method:
array.grep(/\d+/, &:to_i) #[9, 4]
but it returns an array of first integers only. It seems like I have to add something to the pattern but I don't know what.
Or there is another way to grab these numbers in an Array?

you can use:
array.grep(/[\d,]+\.?\d+/)
if you want int:
array.grep(/[\d,]+\.?\d+/).map {_1.gsub(/[^0-9\.]/, '').to_i}
and a faster way (about 5X to 10X):
array.grep(/[\d,]+\.?\d+/).map { _1.delete("^0-9.").to_i }
for a data like:
%w[
,,,4
1
1.2.3.4
-2
1,2,3
9,999.00
4,948
22,956
22,536,129,336
123,456
12.]
use:
data.grep(/^-?\d{1,3}(,\d{3})*(\.?\d+)?$/)
output:
["1", "-2", "9,999.00", "4,948", "22,956", "22,536,129,336", "123,456"]

arr = ["Anyvalue", "Total", "value:", "9,999.00", "Token", " ", "61.4.5",
"|", "chain", "-4,948", "3,25.61", "1,234,567.899"]
rgx = /\A\-?\d{1,3}(?:,\d{3})*(?:\.\d+)?\z/
arr.grep(rgx)
#=> ["9,999.00", "-4,948", "1,234,567.899"]
Regex demo. At the link the regular expression was evaluated with the PCRE regex engine but the results are the same when Ruby's Onigmo engine is used. Also, at the link I've used the anchors ^ and $ (beginning and end of line) instead of \A and \z (beginning and end of string) in order test the regex against multiple strings.
The regular expression can be broken down as follows.
/
\A # match the beginning of the string
\-? # optionally match '-'
\d{1,3} # match between 1 and 3 digits inclusively
(?: # begin a non-capture group
,\d{3} # match a comma followed by 3 digits
)* # end the non-capture group and execute 0 or more times
(?: # begin a non-capture group
\.\d+ # match a period followed by one or more digits
)? # end the non-capture and make it optional
\z # match the end of the string
/
To make the test more robust we could use the methods Kernel::Float, Kernel::Rational and Kernel::Complex, all with the optional argument :exception set to false.
arr = ["Total", "9,999.00", " ", "61.4.5", "23e4", "-234.7e-2", "1+2i",
"3/4", "|", "chain", "-4,948", "3,25.61", "1,234,567.899", "10"]
arr.select { |s| s.match?(rxg) || Float(s, exception: false) ||
Rational(s, exception: false) Complex(s, exception: false) }
#=> ["9,999.00", "23e4", "-234.7e-2", "1+2i", "3/4", "-4,948",
# "1,234,567.899", "10"]
Note that "23e4", "-234.7e-2", "1+2i" and "3/4" are respectively the string representations of an integer, float, complex and rational number.

Related

How to match different groups in regex

I have the following string:
"Josua de Grave* (1643-1712)"
Everything before the * is the person's name, the first date 1634 is his birth date, 1712 is the date of his death.
Following this logic I'd like to have 3 match groups for each one of the item. I tried
([a-zA-Z|\s]*)\* (\d{3,4})-(\d{3,4})
"Josua de Grave* (1643-1712)".match(/([a-zA-Z|\s]*)\* (\d{3,4})-(\d{3,4})/)
but that returns nil.
Why is my logic wrong, and what should I do to get the 3 intended match groups.
The additional brackets ( ) around the digit 1643-1712 values needs to be added in your regex pattern so use
([a-zA-Z\s]*)\* \((\d{3,4})-(\d{3,4})\)
// ^^ ^^
since brackets represents the captured group so escape them using \ to match them as a character.
While you can use a pattern, the problem of splitting this into its parts can also be easily done using other Ruby methods:
Using split:
s = "Josua de Grave* (1643-1712)"
name, dates = s.split('*') # => ["Josua de Grave", " (1643-1712)"]
birth, death = dates[2..-2].split('-') # => ["1643", "1712"]
Or, using scan:
*name, birth, death = s.scan(/[[:alnum:]]+/) # => ["Josua", "de", "Grave", "1643", "1712"]
name.join(' ') # => "Josua de Grave"
birth # => "1643"
death # => "1712"
If I was using a pattern, I'd use this:
name, birth, death = /^([^*]+).+?(\d+)-(\d+)/.match(s)[1..3] # => ["Josua de Grave", "1643", "1712"]
name # => "Josua de Grave"
birth # => "1643"
death # => "1712"
/(^[^*]+).+?(\d+)-(\d+)/ means:
^ start at the beginning of the buffer
([^*]+) capture everything not *, where it'll stop capturing
.+? skip the minimum until...
(\d+) the year is matched and captured
- match but don't capture
(\d+) the year is matched and captured
Regexper helps explain it as does Rubular.
r = /\*\s+\(|(?<=\d)\s*-\s*|\)/
"Josua de Grave* (1643-1712)".split r
#=> ["Josua de Grave", "1643", "1712"]
"Sir Winston Leonard Spencer-Churchill* (1874 - 1965)".split r
#=> ["Sir Winston Leonard Spencer-Churchill", "1874", "1965"]
The regular expression can be made self-documenting by writing it in free-spacing mode:
r = /
\*\s+\( # match '*' then >= 1 whitespaces then '('
| # or
(?<=\d) # match is preceded by a digit (positive lookbehind)
\s*-\s* # match >= 0 whitespaces then '-' then >= 0 whitespaces
| # or
\) # match ')'
/x # free-spacing regex definition mode
The positive lookbehind is needed to avoid splitting hyphenated names on hyphens. (The positive lookahead (?=\d), placed after \s*-\s*, could be used instead.)

Regex scan fails

I am trying to parse all money from a string. For example, I want to extract:
['$250,000', '$3.90', '$250,000', '$500,000']
from:
'Up to $250,000………………………………… $3.90 Over $250,000 to $500,000'
The regex:
\$\ ?(\d+\,)*\d+(\.\d*)?
seems to match all money expressions as in this link. However, when I try to scan on Ruby, it fails to give me the desired result.
s # => "Up to $250,000 $3.90 Over $250,000 to $500,000, add$3.70 Over $500,000 to $1,000,000, add..$3.40 Over $1,000,000 to $2,000,000, add...........$2.25\nOver $2,000,000 add ..$2.00"
r # => /\$\ ?(\d+\,)*\d+\.?\d*/
s.scan(r)
# => [["250,"], [nil], ["250,"], ["500,"], [nil], ["500,"], ["000,"], [nil], ["000,"], ["000,"], [nil], ["000,"], [nil]]
From String#scan docs, it looks like this is because of the group. How can I parse all the money in the string?
Let's look at your regular expression, which I'll write in free-spacing mode so I can document it:
r = /
\$ # match a dollar sign
\ ? # optionally match a space (has no effect)
( # begin capture group 1
\d+ # match one or more digits
, # match a comma (need not be escaped)
)* # end capture group 1 and execute it >= 0 times
\d+ # match one or more digits
\.? # optionally match a period
\d* # match zero or more digits
/x # free-spacing regex definition mode
In non-free-spacing mode this would be written as follows.
r = /\$ ?(\d+,)*\d+\.?\d*/
When a regex is defined in free-spacing mode all spaces are stripped out before the regex is evaluated, which is why I had to escape the space. That's not necessary when the regex is not defined in free-spacing mode.
It is nowhere needed to match a space after the dollars sign, so \ ? should be removed. Suppose now we have
r = /\$\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> ["$2.31", "$44.", "$33.607"]
That works, but it is questionable whether you want to match values that do not have exactly two digits after the decimal point.
Now write
r = /\$(\d+,)*\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> [[nil], [nil], [nil]]
To see why this result was obtained examine the doc for String#scan, specifically the last sentence of the first paragraph: " If the pattern contains groups, each individual result is itself an array containing one entry per group.".
We can avoid that problem by changing the capture group to a non-capture group:
r = /\$(?:\d+,)*\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> ["$2.31", "$44.", "$33.607"]
Now consider this:
"$2,241.31 cat $1,2345. dog $33.607".scan r
#=> ["$2,241.31", "$1,2345.", "$33.607"]
which is still not quite right. Try the following.
r = /
\$ # match a dollar sign
\d{1,3} # match one to three digits
(?:,\d{3}) # match ',' then 3 digits in a nc group
* # execute the above nc group >=0 times
(?:\.\d{2}) # match '.' then 2 digits in a nc group
? # optionally match the above nc group
(?![\d,.]) # no following digit, ',' or '.'
/x # free-spacing regex definition mode
"$2,241.31 $2 $1,234 $3,6152 $33.607 $146.27".scan r
#=> ["$2,241.31", "$2", "$1,234", "$146.27"]
(?![\d,.]) is a negative lookahead.
In normal mode this regular expression is written as follows.
r = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?(?![\d,.])/
The following erroneous result would obtain without the negative lookahead at the end of the regex.
r = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?/
"$2,241.31 $2 $1,234 $3,6152 $33.607 $146.27".scan r
#=> ["$2,241.31", "$2", "$1,234", "$3,615", "$33.60",
# "$146.27"]
[3] pry(main)> str = <<EOF
[3] pry(main)* Up to $250,000………………………………… $3.90 Over $250,000 to $500,000, add………………$3.70 Over $500,000 to $1,000,000, add……………..$3.40 Over $1,000,000 to $2,000,000, add……...........$2.25
[3] pry(main)* Over $2,000,000 add …..………………………$2.00
[3] pry(main)* EOF
=> "Up to $250,000………………………………… $3.90 Over $250,000 to $500,000, add………………$3.70 Over $500,000 to $1,000,000, add……………..$3.40 Over $1,000,000 to $2,000,000, add……...........$2.25\nOver $2,000,000 add …..………………………$2.00\n"
[4] pry(main)> str.scan /\$\d+(?:[,.]\d+)*/
=> ["$250,000", "$3.90", "$250,000", "$500,000", "$3.70", "$500,000", "$1,000,000", "$3.40", "$1,000,000", "$2,000,000", "$2.25", "$2,000,000", "$2.00"]
[5] pry(main)>

Why is a space required in Regex when part of match is optional?

I have been looking for noun phrases (noun, plus optional determiner, plus multiple optional adjectives). I wrote this long and terrible bit:
import argparse, re, nltk
def get_words(tagged_sentences):
words = re.findall(r'\w*\.*\,*/', tagged_sentences)
clean_word = []
for word in words:
word = word[:-1]
clean_word.append(word)
# return clean_word
return ' '.join(clean_word)
noun_phrase = re.findall(r'(\w*/DT\s\w*/JJ\s\w*/NN)|(\w*/DT\s\w*/JJ\s\w*/NN)|(\w*/DT\s\w*/JJ\s\w*/NNP)|(\w*/DT\s\w*/JJ\s\w*/NNPS)|(\w*/JJ\s\w*/NNS)|(\w*/JJ\s\w*/NN)|(\w*/JJ\s\w*/NNP)|(\w*/JJ\s\w*/NNPS)|(\w*/DT\s\w*/NNS)|(\w*/DT\s\w*/NN)|(\w*/DT\s\w*/NNP)|(\w*/DT\s\w*/NNPS)|(\w*/NNS)|(\w*/NN)|(\w*/NNP)|(\w*/NNPS)', tagged_sentences)
phrases = []
for word in noun_phrase:
phrase = get_words(str(word))
phrases.append(phrase)
return phrases
At first, I was trying to use .* after the NN or the JJ, but that didn't work. What was I doing wrong? I did something like (\w*/DT\s\w*/JJ.* \s\w*/NN.*) to account for all the different ways words could be tagged as (Adjectives can be JJ,JJR,JJS while Nouns can be NN,NNS,NNP,NNPS)
pos_sent = 'All/DT good/JJ animals/NNS are/VBP equal/JJ ,/, but/CC some/DT animals/NNS are/VBP more/RBR equal/JJ than/IN others/NNS ./.'
Then I saw this:
noun_phrase = re.findall(r'(\S+\/DT )?(\S+\/JJ )*(\S+\/NN )*(\S+\/NN)', tagged_sentences)
I liked it because it is way better in every way to what I first did. BUT I don't understand why the spaces are required after 'DT', 'JJ', and the first 'NN'(but cannot be there after the second 'NN'). I am not even sure why the two NN 'finds' cannot be placed into one.
I also preferred to use \w to \S, because it should be real letters not just not white space. Anyway, help understanding WHY would very much be appreciated.
Ok, here is your example text:
All/DT good/JJ animals/NNS are/VBP equal/JJ ,/, but/CC some/DT
animals/NNS are/VBP more/RBR equal/JJ than/IN others/NNS ./.
And this is your regular expression you want to understand:
r'(\S+\/DT )?(\S+\/JJ )*(\S+\/NN )*(\S+\/NN)'
Let's take it one group at a time:
(\S+\/DT ) matches 'All/DT ' and 'some/DT '
(\S+\/JJ ) matches 'good/JJ ' and 'equal/JJ '
(\S+\/NN ) matches nothing
(\S+\/NN) matches 'animals/NN' and 'others/NN'
You're using re.findall(), but that doesn't mean find all of these groups, it means consider the entire regex and find all occurrences of the entire pattern. In addition to the groups, it's key that note that the because of the question mark, your first pattern (\S+\/DT ) is optional. Because of the asterisks, your second (\S+\/JJ ) and third (\S+\/NN ) patterns will match zero more times. Thus, they are also effectively optional and the only thing required is your last pattern (\S+\/NN).
A quick test looks like this
import re
s = 'All/DT good/JJ animals/NNS are/VBP equal/JJ ,/, but/CC some/DT animals/NNS are/VBP more/RBR equal/ JJ than/IN others/NNS ./.'
pat = r'(\S+\/DT )?(\S+\/JJ )*(\S+\/NN )*(\S+\/NN)'
res = re.findall(pat, s)
for i, g in enumerate(res):
print('{}: {}'.format(i, g))
which gives this output:
0: ('All/DT ', 'good/JJ ', '', 'animals/NN')
1: ('some/DT ', '', '', 'animals/NN')
2: ('', '', '', 'others/NN')
If we remove the spaces,
pat2 = r'(\S+\/DT)?(\S+\/JJ)*(\S+\/NN)*(\S+\/NN)'
res2 = re.findall(pat2, s)
for i, g in enumerate(res2):
print('{}: {}'.format(i, g))
the output will be exactly what you'd expect,
0: ('', '', '', 'animals/NN')
1: ('', '', '', 'animals/NN')
2: ('', '', '', 'others/NN')
i.e., only the required ones match. Your question is why? I think the issue is that you may feel you are issuing a series of patterns to look for, but you are looking for a single pattern that has multiple match groups. In other words, your regular expression requires these pattern to exist in the order you specify. If they aren't there, then sure, it doesn't matter that they are ordered, but if they are there, they have to be ordered exactly as the regex specifies.
So with the spaces (\S+\/DT )?(\S+\/JJ ), matches 'All/DT good/JJ ' because it literally says match 1 or more non whitespace characters plus a forward slash plus DT plus a space followed by 1 or more whitespace characters plus a forward slash plus 'JJ'. Without the spaces (\S+\/DT)?(\S+\/JJ), the match would require either that the entire (\S+\/DT ) pattern NOT be there, or that if it IS there, it's definitely does NOT contain a space after 'DT'.
I think the key is that you're matching the entire sequence. Without the space, it simply doesn't match the text anymore. If you want these patterns to be considered independently, you will need to use the pipe symbol (|) to indicate OR between your pattern groups.
What you may do is to write a linear regex using optional groups and simplify your code to only processes valid matches and leverage list comprehension:
import re
def get_words(tagged_sentences):
clean_word = re.findall(r'(\w+)/', tagged_sentences)
return ' '.join(clean_word)
tagged_sentences = 'All/DT good/JJ animals/NNS are/VBP equal/JJ ,/, but/CC some/DT animals/NNS are/VBP more/RBR equal/JJ than/IN others/NNS ./.'
pat = r"""\w*(?:/(?:JJ\s\w*|DT\s\w*(?:/JJ\s\w*)?))?/NN(?:[SP]?|PS)"""
noun_phrase = re.findall(pat, tagged_sentences)
phrases = [get_words(str(word)) for word in noun_phrase]
print(phrases)
# => ['All good animals', 'some animals', 'others']
See the Python demo.
The pattern extraction regex now matches:
\w* - 0+ word (letter/digit/_ chars) chars (replace * with + to match one or more)
(?:/(?:JJ\s\w*|DT\s\w*(?:/JJ\s\w*)?))? - an optional sequence (due to ? quantifier) of:
/ - a slash
(?:JJ\s\w*|DT\s\w*(?:/JJ\s\w*)?) - either of the sequences of:
JJ\s\w* - JJ followed with 1 whitespace (add + after it to match one or more) and then 0+ word chars
| - or
DT\s\w*(?:/JJ\s\w*)? - a DT, then 1 whitespace, then 0+ word chars, and then an optional sequence of /JJ, followed with 1 whitespace and 0+ word chars
/NN - a literal substring /NN
(?:[SP]?|PS) - either S, or P, or PS or an empty string (since the [SP]? is optional).
The regex that gets the word from the tagged token is
re.findall(r'(\w+)/', tagged_sentences)
Here, (\w+)/ matches and captures 1+ word chars, and they are extracted with re.findall while the / is omitted from the result as it is not part of the capturing group.

Regex - Extracting a number when preceeded OR followed by a currency sign

if (preg_match_all('((([£€$¥](([ 0-9]([0-9])*)((\.|\,)(\d{2}|\d{1}))|([ 0-9]([0-9])*)))|(([0-9]([0-9])*)((\.|\,)(\d{2}|\d{1})(\s{0}|\s{1}))|([0-9]([0-9])*(\s{0}|\s{1})))[£€$¥]))', $Commande, $matches)) {
$tot1 = $matches[0];
This is my tested solution.
It works for all 4 currencies when sign is placed before or after, with or without a space in between.
It works with a dot or a comma for decimals.
It works without decimal, or with just 1 number after the dot or comma.
It extracts several amounts in the same string in a mix of formats declined above as long as there is a space in between.
I think it covers everything, although I am sure it can be simplified.
It was Needed for an international order form where clients enter the amounts themselves as well as the description in the same field.
You can use a conditional:
if (preg_match_all('~(\$ ?)?[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})?(?:[pcm]|bn|[mb]illion)?(?(1)| ?\$)~i', $order, $matches)) {
$tot = $matches[0];
}
Explanation:
I put the currency in the first capturing group: (\$ ?) and I make it optional with a ?
At the end of the pattern, I use an if then else:
(?(1) # if the first capturing group exist
# then match nothing
| # else
[ ]?\$ # matches the currency
) # end of the conditional
You should check for optional $ at the end of amount:
\$? ?(\d[\d ,]*(?:\.\d{1,2})?|\d[\d,](?:\.\d{2})?) ?\$?(?:[pcm]|bn|[mb]illion)
Live demo

Regular expression to find unescaped double quotes in CSV file

What would a regular expression be to find sets of 2 unescaped double quotes that are contained in columns set off by double quotes in a CSV file?
Not a match:
"asdf","asdf"
"", "asdf"
"asdf", ""
"adsf", "", "asdf"
Match:
"asdf""asdf", "asdf"
"asdf", """asdf"""
"asdf", """"
Try this:
(?m)""(?![ \t]*(,|$))
Explanation:
(?m) // enable multi-line matching (^ will act as the start of the line and $ will act as the end of the line (i))
"" // match two successive double quotes
(?! // start negative look ahead
[ \t]* // zero or more spaces or tabs
( // open group 1
, // match a comma
| // OR
$ // the end of the line or string
) // close group 1
) // stop negative look ahead
So, in plain English: "match two successive double quotes, only if they DON'T have a comma or end-of-the-line ahead of them with optionally spaces and tabs in between".
(i) besides being the normal start-of-the-string and end-of-the-string meta characters.
Due to the complexity of your problem, the solution depends on the engine you are using. This because to solve it you must use look behind and look ahead and each engine is not the same one this.
My answer is using Ruby engine. The checking is just one RegEx but I out the whole code here for better explain it.
NOTE that, due to Ruby RegEx engine (or my knowledge), optional look ahead/behind is not possible. So I need a small problem of spaces before and after comma.
Here is my code:
orgTexts = [
'"asdf","asdf"',
'"", "asdf"',
'"asdf", ""',
'"adsf", "", "asdf"',
'"asdf""asdf", "asdf"',
'"asdf", """asdf"""',
'"asdf", """"'
]
orgTexts.each{|orgText|
# Preprocessing - Eliminate spaces before and after comma
# Here is needed if you may have spaces before and after a valid comma
orgText = orgText.gsub(Regexp.new('\" *, *\"'), '","')
# Detect valid character (non-quote and valid quote)
resText = orgText.gsub(Regexp.new('([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")'), '-')
# resText = orgText.gsub(Regexp.new('([^\"]|(^|(?<=,)|(?<=\\\\))\"|\"($|(?=,)))'), '-')
# [^\"] ===> A non qoute
# | ===> or
# ^\" ===> beginning quot
# | ===> or
# \"$ ===> endding quot
# | ===> or
# (?<=,)\" ===> quot just after comma
# \"(?=,) ===> quot just before comma
# (?<=\\\\)\" ===> escaped quot
# This part is to show the invalid non-escaped quots
print orgText
print resText.gsub(Regexp.new('"'), '^')
# This part is to determine if there is non-escaped quotes
# Here is the actual matching, use this one if you don't want to know which quote is un-escaped
isMatch = ((orgText =~ /^([^\"]|^\"|\"$|(?<=,)\"|\"(?=,)|(?<=\\\\)\")*$/) != 0).to_s
# Basicall, it match it from start to end (^...$) there is only a valid character
print orgText + ": " + isMatch
print
print ""
print ""
}
When executed the code prints:
"asdf","asdf"
-------------
"asdf","asdf": false
"","asdf"
---------
"","asdf": false
"asdf",""
---------
"asdf","": false
"adsf","","asdf"
----------------
"adsf","","asdf": false
"asdf""asdf","asdf"
-----^^------------
"asdf""asdf","asdf": true
"asdf","""asdf"""
--------^^----^^-
"asdf","""asdf""": true
"asdf",""""
--------^^-
"asdf","""": true
I hope I give you some idea here that you can use with other engine and language.
".*"(\n|(".*",)*)
should work, I guess...
For single-line matches:
^("[^"]*"\s*,\s*)*"[^"]*""[^"]*"
or for multi-line:
(^|\r\n)("[^\r\n"]*"\s*,\s*)*"[^\r\n"]*""[^\r\n"]*"
Edit/Note: Depending on the regex engine used, you could use lookbehinds and other stuff to make the regex leaner. But this should work in most regex engines just fine.
Try this regular expression:
"(?:[^",\\]*|\\.)*(?:""(?:[^",\\]*|\\.)*)+"
That will match any quoted string with at least one pair of unescaped double quotes.