Python: My regular expression doesn't match on some string cases - regex

I try to creating a regular expression and so far I made this code:
(-?['0|1']{1}.[00000000e+00| ]?){1}\s(-?['0|1']{1}.[00000000e+00| ]?){1}
My goal is to detect pattern that ({string pattern}{blank}{string pattern}).
This is my string pattern:
'0'
'-0.'
'1.'
'-1.'
'1.00000000e+00'
'0.00000000e+00'
'-0.00000000e+00'
'-1.00000000e+00'
'0. ' (The blanks can be at least 1 to 8 characters long.)
'-0. ' (The blanks can be at least 1 to 8 characters long.)
'1. ' (The blanks can be at least 1 to 8 characters long.)
'-1. ' (The blanks can be at least 1 to 8 characters long.)
My code is mostly successful in test cases, but problems occur in some test cases.
(e.g. error occurred with '00000000e+00' or ' ')
Especially, it is too difficult for me because there can be at least 1 to 8 blank(' ') characters.
This is my test case:
['0. 0.']
['0. 1.']
['1. 0.']
['1. 1.']
['-0. -0.']
['-0. 0.']
['0. -0.']
['1. -0.']
['1. -1.']
['-1. 1.']
['-1. -1.']
['-1.00000000e+00 0.'] # Fail
['0. -1. '] # Fail
['0. 0. '] # Fail
['-0.00000000e+00 1.00000000e+00'] # Fail
['-0. 1.00000000e+00'] # Fail
Please give me some advice.

You could use
(-?[01]\.(?:00000000e\+00| {1,8})?)\s(-?[01]\.(?:00000000e\+00| {1,8})?)
The pattern matches:
( Capture group 1
-?[01]\. Match an optional - either 0 or 1 and a . (note to escape the dot)
(?: Non capture group for the alternation |
00000000e\+00| {1,8} Match either 00000000e+00 or 1-8 spaces
)? Close non capture group and make it optional
) Close group 1
\s Match a single whitespace char
(-?[01]\.(?:00000000e\+00| {1,8})?) Capture group 2, the same pattern as capture group 1
Regex demo
Note that \s could also match a newline, and if you want the match only you can omit the capture groups.
There is no language tagged, but if supported you might shorten the pattern recursing the first sub pattern as the pattern uses the same part twice.
(-?[01]\.(?:0{8}e\+00| {1,8})?)\s(?1)
Regex demo

Would you please try the following:
import re
l = ['0. 0.',
'0. 1.',
'1. 0.',
'1. 1.',
'-0. -0.',
'-0. 0.',
'0. -0.',
'1. -0.',
'1. -1.',
'-1. 1.',
'-1. -1.',
'-1.00000000e+00 0.',
'0. -1. ',
'0. 0. ',
'-0.00000000e+00 1.00000000e+00',
'-0. 1.00000000e+00']
for s in l:
if re.match(r'-?[0|1]\.?(?:0{8}e\+00|\s{1,8})?\s-?[0|1]\.?(?:0{8}e\+00|\s{1,8})?$', s):
print("match")
else:
print("no match")
Explanation of regex -?[0|1]\.?(?:0{8}e\+00|\s{1,8})?:
-? matches a dash character of length 0 or 1
[0|1]\.? matches 0 or 1 followed by an optional dot character
0{8}e\+00 matches a substring 00000000e+00
\s{1,8} matches whitespaces of length between 1 and 8
(?:0{8}e\+00|\s{1,8})? matches either or none of two regexes above

Apparently you have two false impressions.
You seem to think of [ ] as a group construct while it denotes a character class.
You seem to think you'd have to include the string delimiting quotes in the pattern.
Since one could interpret your question to the effect that you want to test for two numbers of -1, 0 or 1, and others already gave regex answers, here's a regex-free alternative for that problem:
test = ['0. 0.', '0. 1.', '1. 0.', '1. 1.', '-0. -0.', '-0. 0.', '0. -0.', '1. -0.',
'1. -1.', '-1. 1.', '-1. -1.', '-1.00000000e+00 0.', '0. -1. ', '0. 0. ',
'-0.00000000e+00 1.00000000e+00', '-0. 1.00000000e+00', 'x y', '-1 0 1']
for t in test:
print([t], end='\t')
s = t.split()
try:
if len(s) != 2: raise ValueError
for f in s:
g = float(f)
if g!=-1 and g!=0 and g!=1: raise ValueError
except ValueError:
print('Fail')
else:
print('Pass')

Related

how to get numbers from array of strings?

I have this array of strings.
["Anyvalue", "Total", "value:", "9,999.00", "Token", " ", "|", " ", "Total", "chain", "value:", "4,948"]
and I'm trying to get numbers in one line of code. I tried many methods but wasn't really helpful as am expecting.
I'm using one with grep method:
array.grep(/\d+/, &:to_i) #[9, 4]
but it returns an array of first integers only. It seems like I have to add something to the pattern but I don't know what.
Or there is another way to grab these numbers in an Array?
you can use:
array.grep(/[\d,]+\.?\d+/)
if you want int:
array.grep(/[\d,]+\.?\d+/).map {_1.gsub(/[^0-9\.]/, '').to_i}
and a faster way (about 5X to 10X):
array.grep(/[\d,]+\.?\d+/).map { _1.delete("^0-9.").to_i }
for a data like:
%w[
,,,4
1
1.2.3.4
-2
1,2,3
9,999.00
4,948
22,956
22,536,129,336
123,456
12.]
use:
data.grep(/^-?\d{1,3}(,\d{3})*(\.?\d+)?$/)
output:
["1", "-2", "9,999.00", "4,948", "22,956", "22,536,129,336", "123,456"]
arr = ["Anyvalue", "Total", "value:", "9,999.00", "Token", " ", "61.4.5",
"|", "chain", "-4,948", "3,25.61", "1,234,567.899"]
rgx = /\A\-?\d{1,3}(?:,\d{3})*(?:\.\d+)?\z/
arr.grep(rgx)
#=> ["9,999.00", "-4,948", "1,234,567.899"]
Regex demo. At the link the regular expression was evaluated with the PCRE regex engine but the results are the same when Ruby's Onigmo engine is used. Also, at the link I've used the anchors ^ and $ (beginning and end of line) instead of \A and \z (beginning and end of string) in order test the regex against multiple strings.
The regular expression can be broken down as follows.
/
\A # match the beginning of the string
\-? # optionally match '-'
\d{1,3} # match between 1 and 3 digits inclusively
(?: # begin a non-capture group
,\d{3} # match a comma followed by 3 digits
)* # end the non-capture group and execute 0 or more times
(?: # begin a non-capture group
\.\d+ # match a period followed by one or more digits
)? # end the non-capture and make it optional
\z # match the end of the string
/
To make the test more robust we could use the methods Kernel::Float, Kernel::Rational and Kernel::Complex, all with the optional argument :exception set to false.
arr = ["Total", "9,999.00", " ", "61.4.5", "23e4", "-234.7e-2", "1+2i",
"3/4", "|", "chain", "-4,948", "3,25.61", "1,234,567.899", "10"]
arr.select { |s| s.match?(rxg) || Float(s, exception: false) ||
Rational(s, exception: false) Complex(s, exception: false) }
#=> ["9,999.00", "23e4", "-234.7e-2", "1+2i", "3/4", "-4,948",
# "1,234,567.899", "10"]
Note that "23e4", "-234.7e-2", "1+2i" and "3/4" are respectively the string representations of an integer, float, complex and rational number.

Single RegEx to catch multiple options and replace with their corresponding replacements

The problem goes like this:
value match: 218\d{3}(\d{4})#domain.com replace with 10\1 to get 10 followed by last 4 digits
for example 2181234567 would become 104567
value match: 332\d{3}(\d{4})#domain.com replace with 11\1 to get 11 followed by last 4 digits
for example 3321234567 would become 114567
value match: 420\d{3}(\d{4})#domain.com replace with 12\1 to get 12 followed by last 4 digits
..and so on
for example 4201234567 would become 124567
Is there a better way to catch different values and replace with their corresponding replacements in a single RegEx than creating multiple expressions?
Like (218|332|420)\d{3}(\d{4})#domain.com to replace 10\4|11\4|12\4) and get just their corresponding results when matched.
Edit: Didn't specify the use case: It's for my PBX, that just uses RegEx to match patterns and then replace it with the values I want it to go out with. No code. Just straight up RegEx in the GUI.
Also for personal use, if I can get it to work with Notepad++
Ctrl+H
Find what: (?:(218)|(332)|(420))\d{3}(\d{4})(?=#domain\.com)
Replace with: (?{1}10$4)(?{2}11$4)(?{3}12$4)
CHECK Wrap around
CHECK Regular expression
Replace all
Explanation:
(?: # non capture group
(218) # group 1, 218
| # OR
(332) # group 2, 332
| # OR
(420) # group 3, 420
) # end group
\d{3} # 3 digits
(\d{4}) # group 4, 4 digits
(?=#domain\.com) # positive lookahead, make sure we have "#domain.com" after
# that allows to keep "#domain.com"
# if you want to remove it from the result, just put "#domain\.com"
# without lookahead.
Replacement:
(?{1} # if group 1 exists
10 # insert "10"
$4 # insert content of group 4
) # endif
(?{2}11$4) # same as above
(?{3}12$4) # same as above
Screenshot (before):
Screenshot (after):
I don't think you can use a single regular expression to conditionally replace text as per your example. You either need to chain multiple search & replace, or use a function that does a lookup based on the first captured group (first three digits).
You did not specify the language used, regular expressions vary based on language. Here is a JavaScript code snippet that uses the function with lookup approach:
var str1 = '2181234567#domain.com';
var str2 = '3321234567#domain.com';
var str3 = '4201234567#domain.com';
var strMap = {
'218': '10',
'332': '11',
'420': '12'
// add more as needed
};
function fixName(str) {
var re = /(\d{3})\d{3}(\d{4})(?=\#domain\.com)/;
var result = str.replace(re, function(m, p1, p2) {
return strMap[p1] + p2;
});
return result;
}
var result1 = fixName(str1);
var result2 = fixName(str2);
var result3 = fixName(str3);
console.log('str1: ' + str1 + ', result1: ' + result1);
console.log('str2: ' + str2 + ', result2: ' + result2);
console.log('str3: ' + str3 + ', result3: ' + result3);
Output:
str1: 2181234567#domain.com, result1: 104567#domain.com
str2: 3321234567#domain.com, result2: 114567#domain.com
str3: 4201234567#domain.com, result3: 124567#domain.com
#Toto has a nice answer, and there is another method if the operator (?{1}...) is not available (but thanks, Toto, I did not know this feature of NotePad++).
More details on my answer here: https://stackoverflow.com/a/63676336/1287856
Append to the end of the doc:
,218=>10,332=>11,420=>12
Search for:
(218|332|420)\d{3}(\d{4})(?=#domain.com)(?=[\s\S]*,\1=>([^,]*))
Replace with
\3\2
watch in action:

How to extract negative/positive floating point/whole numbers and math signs used in this expression?

I have different strings in the form of:
"-33.3*-50"
"3+5"
"-109.51+-33"
I am looking to have the following output:
"-33.3", "*", "-50"
"3", "+", "5"
"-109.51", "+", "-33"
If you only have a format of x (math symbol) y, then following regex will work:
(-?\d+(?:\.\d+)?)(\*|\+|\-|\/|\%)(-?\d+(?:\.\d+)?)
Regex Demo
( # Group 1
-? # Match if - symbol present
\d+ # Match digits
(?:\.\d+)? # Non-capturing group - match if contains float part
) # Close Group 1
( # Group 2
\*|\+|\-|\/|\% # Match math symbols *,+,-,/,%, separated using | -> or. Add others as needed.
) # Close Group 2
(-?\d+(?:\.\d+)?) # Group 3 structure is same as Group 1 above.
Every individual part of the equation is in an individual group. E.g. -33.3 is captured in Group 1, * is captured in Group 2, -50 is captured in Group 3.
You can then substitute with "$1", "$2", "$3" in order to get the result that you want (see bottom of Regex Demo page).
This expression might be somewhat close,
[+*/]|[0-9.-]+
without validation.
Test
import re
string = """
-33.3*-50
3+5
-109.51+-33
-109.51+-33/24
"""
print(re.findall(r'[+*/]|[0-9.-]+', string))
Output
['-33.3', '*', '-50', '3', '+', '5', '-109.51', '+', '-33', '-109.51', '+', '-33', '/', '24']
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.

Regex scan fails

I am trying to parse all money from a string. For example, I want to extract:
['$250,000', '$3.90', '$250,000', '$500,000']
from:
'Up to $250,000………………………………… $3.90 Over $250,000 to $500,000'
The regex:
\$\ ?(\d+\,)*\d+(\.\d*)?
seems to match all money expressions as in this link. However, when I try to scan on Ruby, it fails to give me the desired result.
s # => "Up to $250,000 $3.90 Over $250,000 to $500,000, add$3.70 Over $500,000 to $1,000,000, add..$3.40 Over $1,000,000 to $2,000,000, add...........$2.25\nOver $2,000,000 add ..$2.00"
r # => /\$\ ?(\d+\,)*\d+\.?\d*/
s.scan(r)
# => [["250,"], [nil], ["250,"], ["500,"], [nil], ["500,"], ["000,"], [nil], ["000,"], ["000,"], [nil], ["000,"], [nil]]
From String#scan docs, it looks like this is because of the group. How can I parse all the money in the string?
Let's look at your regular expression, which I'll write in free-spacing mode so I can document it:
r = /
\$ # match a dollar sign
\ ? # optionally match a space (has no effect)
( # begin capture group 1
\d+ # match one or more digits
, # match a comma (need not be escaped)
)* # end capture group 1 and execute it >= 0 times
\d+ # match one or more digits
\.? # optionally match a period
\d* # match zero or more digits
/x # free-spacing regex definition mode
In non-free-spacing mode this would be written as follows.
r = /\$ ?(\d+,)*\d+\.?\d*/
When a regex is defined in free-spacing mode all spaces are stripped out before the regex is evaluated, which is why I had to escape the space. That's not necessary when the regex is not defined in free-spacing mode.
It is nowhere needed to match a space after the dollars sign, so \ ? should be removed. Suppose now we have
r = /\$\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> ["$2.31", "$44.", "$33.607"]
That works, but it is questionable whether you want to match values that do not have exactly two digits after the decimal point.
Now write
r = /\$(\d+,)*\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> [[nil], [nil], [nil]]
To see why this result was obtained examine the doc for String#scan, specifically the last sentence of the first paragraph: " If the pattern contains groups, each individual result is itself an array containing one entry per group.".
We can avoid that problem by changing the capture group to a non-capture group:
r = /\$(?:\d+,)*\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
#=> ["$2.31", "$44.", "$33.607"]
Now consider this:
"$2,241.31 cat $1,2345. dog $33.607".scan r
#=> ["$2,241.31", "$1,2345.", "$33.607"]
which is still not quite right. Try the following.
r = /
\$ # match a dollar sign
\d{1,3} # match one to three digits
(?:,\d{3}) # match ',' then 3 digits in a nc group
* # execute the above nc group >=0 times
(?:\.\d{2}) # match '.' then 2 digits in a nc group
? # optionally match the above nc group
(?![\d,.]) # no following digit, ',' or '.'
/x # free-spacing regex definition mode
"$2,241.31 $2 $1,234 $3,6152 $33.607 $146.27".scan r
#=> ["$2,241.31", "$2", "$1,234", "$146.27"]
(?![\d,.]) is a negative lookahead.
In normal mode this regular expression is written as follows.
r = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?(?![\d,.])/
The following erroneous result would obtain without the negative lookahead at the end of the regex.
r = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?/
"$2,241.31 $2 $1,234 $3,6152 $33.607 $146.27".scan r
#=> ["$2,241.31", "$2", "$1,234", "$3,615", "$33.60",
# "$146.27"]
[3] pry(main)> str = <<EOF
[3] pry(main)* Up to $250,000………………………………… $3.90 Over $250,000 to $500,000, add………………$3.70 Over $500,000 to $1,000,000, add……………..$3.40 Over $1,000,000 to $2,000,000, add……...........$2.25
[3] pry(main)* Over $2,000,000 add …..………………………$2.00
[3] pry(main)* EOF
=> "Up to $250,000………………………………… $3.90 Over $250,000 to $500,000, add………………$3.70 Over $500,000 to $1,000,000, add……………..$3.40 Over $1,000,000 to $2,000,000, add……...........$2.25\nOver $2,000,000 add …..………………………$2.00\n"
[4] pry(main)> str.scan /\$\d+(?:[,.]\d+)*/
=> ["$250,000", "$3.90", "$250,000", "$500,000", "$3.70", "$500,000", "$1,000,000", "$3.40", "$1,000,000", "$2,000,000", "$2.25", "$2,000,000", "$2.00"]
[5] pry(main)>

Why is a space required in Regex when part of match is optional?

I have been looking for noun phrases (noun, plus optional determiner, plus multiple optional adjectives). I wrote this long and terrible bit:
import argparse, re, nltk
def get_words(tagged_sentences):
words = re.findall(r'\w*\.*\,*/', tagged_sentences)
clean_word = []
for word in words:
word = word[:-1]
clean_word.append(word)
# return clean_word
return ' '.join(clean_word)
noun_phrase = re.findall(r'(\w*/DT\s\w*/JJ\s\w*/NN)|(\w*/DT\s\w*/JJ\s\w*/NN)|(\w*/DT\s\w*/JJ\s\w*/NNP)|(\w*/DT\s\w*/JJ\s\w*/NNPS)|(\w*/JJ\s\w*/NNS)|(\w*/JJ\s\w*/NN)|(\w*/JJ\s\w*/NNP)|(\w*/JJ\s\w*/NNPS)|(\w*/DT\s\w*/NNS)|(\w*/DT\s\w*/NN)|(\w*/DT\s\w*/NNP)|(\w*/DT\s\w*/NNPS)|(\w*/NNS)|(\w*/NN)|(\w*/NNP)|(\w*/NNPS)', tagged_sentences)
phrases = []
for word in noun_phrase:
phrase = get_words(str(word))
phrases.append(phrase)
return phrases
At first, I was trying to use .* after the NN or the JJ, but that didn't work. What was I doing wrong? I did something like (\w*/DT\s\w*/JJ.* \s\w*/NN.*) to account for all the different ways words could be tagged as (Adjectives can be JJ,JJR,JJS while Nouns can be NN,NNS,NNP,NNPS)
pos_sent = 'All/DT good/JJ animals/NNS are/VBP equal/JJ ,/, but/CC some/DT animals/NNS are/VBP more/RBR equal/JJ than/IN others/NNS ./.'
Then I saw this:
noun_phrase = re.findall(r'(\S+\/DT )?(\S+\/JJ )*(\S+\/NN )*(\S+\/NN)', tagged_sentences)
I liked it because it is way better in every way to what I first did. BUT I don't understand why the spaces are required after 'DT', 'JJ', and the first 'NN'(but cannot be there after the second 'NN'). I am not even sure why the two NN 'finds' cannot be placed into one.
I also preferred to use \w to \S, because it should be real letters not just not white space. Anyway, help understanding WHY would very much be appreciated.
Ok, here is your example text:
All/DT good/JJ animals/NNS are/VBP equal/JJ ,/, but/CC some/DT
animals/NNS are/VBP more/RBR equal/JJ than/IN others/NNS ./.
And this is your regular expression you want to understand:
r'(\S+\/DT )?(\S+\/JJ )*(\S+\/NN )*(\S+\/NN)'
Let's take it one group at a time:
(\S+\/DT ) matches 'All/DT ' and 'some/DT '
(\S+\/JJ ) matches 'good/JJ ' and 'equal/JJ '
(\S+\/NN ) matches nothing
(\S+\/NN) matches 'animals/NN' and 'others/NN'
You're using re.findall(), but that doesn't mean find all of these groups, it means consider the entire regex and find all occurrences of the entire pattern. In addition to the groups, it's key that note that the because of the question mark, your first pattern (\S+\/DT ) is optional. Because of the asterisks, your second (\S+\/JJ ) and third (\S+\/NN ) patterns will match zero more times. Thus, they are also effectively optional and the only thing required is your last pattern (\S+\/NN).
A quick test looks like this
import re
s = 'All/DT good/JJ animals/NNS are/VBP equal/JJ ,/, but/CC some/DT animals/NNS are/VBP more/RBR equal/ JJ than/IN others/NNS ./.'
pat = r'(\S+\/DT )?(\S+\/JJ )*(\S+\/NN )*(\S+\/NN)'
res = re.findall(pat, s)
for i, g in enumerate(res):
print('{}: {}'.format(i, g))
which gives this output:
0: ('All/DT ', 'good/JJ ', '', 'animals/NN')
1: ('some/DT ', '', '', 'animals/NN')
2: ('', '', '', 'others/NN')
If we remove the spaces,
pat2 = r'(\S+\/DT)?(\S+\/JJ)*(\S+\/NN)*(\S+\/NN)'
res2 = re.findall(pat2, s)
for i, g in enumerate(res2):
print('{}: {}'.format(i, g))
the output will be exactly what you'd expect,
0: ('', '', '', 'animals/NN')
1: ('', '', '', 'animals/NN')
2: ('', '', '', 'others/NN')
i.e., only the required ones match. Your question is why? I think the issue is that you may feel you are issuing a series of patterns to look for, but you are looking for a single pattern that has multiple match groups. In other words, your regular expression requires these pattern to exist in the order you specify. If they aren't there, then sure, it doesn't matter that they are ordered, but if they are there, they have to be ordered exactly as the regex specifies.
So with the spaces (\S+\/DT )?(\S+\/JJ ), matches 'All/DT good/JJ ' because it literally says match 1 or more non whitespace characters plus a forward slash plus DT plus a space followed by 1 or more whitespace characters plus a forward slash plus 'JJ'. Without the spaces (\S+\/DT)?(\S+\/JJ), the match would require either that the entire (\S+\/DT ) pattern NOT be there, or that if it IS there, it's definitely does NOT contain a space after 'DT'.
I think the key is that you're matching the entire sequence. Without the space, it simply doesn't match the text anymore. If you want these patterns to be considered independently, you will need to use the pipe symbol (|) to indicate OR between your pattern groups.
What you may do is to write a linear regex using optional groups and simplify your code to only processes valid matches and leverage list comprehension:
import re
def get_words(tagged_sentences):
clean_word = re.findall(r'(\w+)/', tagged_sentences)
return ' '.join(clean_word)
tagged_sentences = 'All/DT good/JJ animals/NNS are/VBP equal/JJ ,/, but/CC some/DT animals/NNS are/VBP more/RBR equal/JJ than/IN others/NNS ./.'
pat = r"""\w*(?:/(?:JJ\s\w*|DT\s\w*(?:/JJ\s\w*)?))?/NN(?:[SP]?|PS)"""
noun_phrase = re.findall(pat, tagged_sentences)
phrases = [get_words(str(word)) for word in noun_phrase]
print(phrases)
# => ['All good animals', 'some animals', 'others']
See the Python demo.
The pattern extraction regex now matches:
\w* - 0+ word (letter/digit/_ chars) chars (replace * with + to match one or more)
(?:/(?:JJ\s\w*|DT\s\w*(?:/JJ\s\w*)?))? - an optional sequence (due to ? quantifier) of:
/ - a slash
(?:JJ\s\w*|DT\s\w*(?:/JJ\s\w*)?) - either of the sequences of:
JJ\s\w* - JJ followed with 1 whitespace (add + after it to match one or more) and then 0+ word chars
| - or
DT\s\w*(?:/JJ\s\w*)? - a DT, then 1 whitespace, then 0+ word chars, and then an optional sequence of /JJ, followed with 1 whitespace and 0+ word chars
/NN - a literal substring /NN
(?:[SP]?|PS) - either S, or P, or PS or an empty string (since the [SP]? is optional).
The regex that gets the word from the tagged token is
re.findall(r'(\w+)/', tagged_sentences)
Here, (\w+)/ matches and captures 1+ word chars, and they are extracted with re.findall while the / is omitted from the result as it is not part of the capturing group.