Regex optional group selection doesn't work

Regex optional group selection doesn't work - regex

I want to extract the numbers from the following text:
Something_Time 10 min (Time in Class T>60�C Something Something )
Something_Time 899 min (Time in Class 35�C<T<=40�C Something Something )
Something_Time 0 min (Time in Class T<=-25�C Something Something )
So what I need is:
|---------------|---------------|---------------|
| Group 1 | Group 2 | Group 3 |
|---------------|---------------|---------------|
| 10 | 60 | |
|---------------|---------------|---------------|
| 899 | 35 | 40 |
|---------------|---------------|---------------|
| 0 | | -25 |
|---------------|---------------|---------------|
Group 2 as lower bound and group 3 as upper bound.
I tried the following regex expression:
^.* (\d{1,6}) min .*(?:[ \>](\-?\d{1,2}))?.*(?:[\=](\-?\d{1,2}))?.*$
This unfortunately does not match groups 2 and 3. It works for the second line as soon as the ? is removed from the end of both groups. Do you have any suggestions?

Try:
^Something_Time (\d{1,6}) min(?:.*?[ >](-?\d{1,2}))?(?:.*?[ =](-?\d{1,2}))?.*$
See Regex Demo
^ Matches start of string.
Something_Time Matches 'Something_Time '
(\d{1,6}) Group 1: 1 - 6 digits
min Matches ' min'
(?:.*?[ >](-?\d{1,2}))? Optional group that matches 0 or more non-newline characters followed by either a space or '>' followed by a number (optional '-' followed by up to 2 digits). The number is placed in Group 2.
(?:.*?[ =](-?\d{1,2}))? Optional group that matches 0 or more non-newline characters followed by either a space or '=' followed by a number (optional '-' followed by up to 2 digits). The number is placed in Group 3.
.* Matches 0 or more non-newline characters.
$ Matches the end of the string or a newline that precedes the end of the string.
In Python:
import re
tests = [
'Something_Time 10 min (Time in Class T>60�C Something Something )',
'Something_Time 899 min (Time in Class 35�C<T<=40�C Something Something )',
'Something_Time 0 min (Time in Class T<=-25�C Something Something )'
]
for test in tests:
m = re.match(r'^Something_Time (\d{1,6}) min(?:.*?[ >](-?\d{1,2}))?(?:.*?[ =](-?\d{1,2}))?.*$', test)
if m:
print(m.groups())
Prints:
('10', '60', None)
('899', '35', '40')
('0', None, '-25')

Related

Capture 1-9 after the last occurrence of 0

I want to capture all numbers between 1 and 9 after the last occurrence of zero except zero in the last digit. I tried this pattern it seems that it doesn’t work.
Pattern: [1-9].*
DATA
0100179835
3000766774
1500396843
1500028408
1508408637
3105230262
3005228061
3105228407
3105228940
0900000000
2100000000
0800000000
1000000001
2200000001
0800000001
1300000001
1000000002
2200000002
0800000002
1300000002
1000000003
2200000003
0800000003
1300000003
1000000004
2200000004
0800000004
1300000004
1000000005
2200000005
0800000005
1300000005
1000000006
2300000006
0800000006
0900000006
1000000007
2300000007
0900000007
0800000007
1000000008
2300000008
0900000008
0800000008
1100000009
2300000009
0900000009
0800000009
1000005217
2000000429
1100000020
1000005000
3000000070
2000000400
1000020000
3000200000
2906000000
Desired Result
179835
766774
396843
28408
8408637
5230262
5228061
5228407
5228940
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
6
6
6
6
7
7
7
7
8
8
8
8
9
9
9
9
5217
429
20
5000
70
400
20000
200000
6000000

You can anchor the end of the string and match non-zero digits with an optional trailing zero. Ensure that there is at least one matching digit with a positive lookahead pattern:
(?=\d)[1-9]*0?$
Demo: https://regex101.com/r/uggV37/2

To get desired result:
(?:^0*[1-9]+0*\K0|0\K[1-9]+(?:0[1-9]*|0+)?)$
Explanation
(?: Non capture group for the alternatives
^ Start of string
0*[1-9]+0* Match 1+ digits 1-9 between optional zeroes
\K0 Forget what is matched so far and then match a zero
| Or
0\K Match a zero and forget what is matched so far
[1-9]+ Match 1+ digits 1-9
(?: Non capture group for the alternatives
0[1-9]* Match a zero and optional digits 1-9
| Or
0+ Match 1+ zeroes
)? Close the non capture group
) Close the non capture gruop
$ End of string
See a regex demo.

Match 1 item each line:
'0123056'.match(/(?<=0)[1-9]*0?$/g).filter(m => m != '')
Match multiple item each line:
'0123056 0000210 1205000 1204566 0123456 0012340 0123400'.match(/(?<=0)[1-9]*0?\b/g).filter(m => m != '')

Try to split a string with particular regex expression

i'm trying to split a string using 2 separator and regex. My string is for example
"test 10 20 middle 30 - 40 mm".
and i would like to split in ["test 10", "20 middle 30", "40 mm"]. So, splittin dropping ' - ' and the space between 2 digits.
I tried to do
result = re.split(r'[\d+] [\d+]', s)
> ['test 1', '0 middle 30 - 40 mm']
result2 = re.split(r' - |{\d+} {\d+}', s)
> ['test 10 20 middle 30', '40 mm']
Is there any reg expression to split in ['test 10', '20 middle 30', '40 mm'] ?

You may use
(?<=\d)\s+(?:-\s+)?(?=\d)
See the regex demo.
Details
(?<=\d) - a digit must appear immediately on the left
\s+ - 1+ whitespaces
(?:-\s+)? - an optional sequence of a - followed with 1+ whitespaces
(?=\d) - a digit must appear immediately on the right.
See the Python demo:
import re
text = "test 10 20 middle 30 - 40 mm"
print( re.split(r'(?<=\d)\s+(?:-\s+)?(?=\d)', text) )
# => ['test 10', '20 middle 30', '40 mm']

Data
k="test 10 20 middle 30 - 40 mm"
Please Try
result2 = re.split(r"(^[a-z]+\s\d+|\^d+\s[a-z]+|\d+)$",k)
result2
**^[a-z]**-match lower case alphabets at the start of the string and greedily to the left + followed by:
**`\s`** white space characters
**`\d`** digits greedily matched to the left
| or match start of string with digits \d+ also matched greedily to the left and followed by:
`**\s**` white space characters
**`a-z`** lower case alphabets greedily matched to the left
| or match digits greedily to the left \d+ end the string $
Output

capture in a string everything that is not a token

Context: I am dealing with a mix of boolean and arithmetic expressions that may look like in the following example:
b_1 /\ (0 <= x_1) /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))))
I want to match and extract any constraint of the shape A <= B contained in the formula which must be always true. In the above example, only 0 <= x_1 would satisfy such criterion.
Current Goal:
My idea is to build a simple parse tree of the input formula focusing only on the following tokens: and (/\), or (\/), left bracket (() and right bracket ()). Given the above formula, I would like to generate the following AST:
/\
|_ "b_1"
|_ /\
|_ "0 <= x_1"
|_ \/
|_ "x_2 <= 2"
|_ /\
|_ "b_3"
|_ "(/ 1 3) <= x_4"
Then, I can simply walk through the AST and discard any sub-tree rooted at \/.
My Attempt:
Looking at this documentation, I am defining the grammar for the lexer as follows:
import ply.lex as lex
tokens = (
"LPAREN",
"RPAREN",
"AND",
"OR",
"STRING",
)
t_AND = r'\/\\'
t_OR = r'\\\/'
t_LPAREN = r'\('
t_RPAREN = r'\)'
t_ignore = ' \t\n'
def t_error(t):
print(t)
print("Illegal character '{}'".format(t.value[0]))
t.lexer.skip(1)
def t_STRING(t):
r'^(?!\)|\(| |\t|\n|\\\/|\/\\)'
t.value = t
return t
data = "b_1 /\ (x_2 <= 2 \/ (b_3 /\ ((/ 1 3) <= x_4))"
lexer = lex.lex()
lexer.input(data)
while True:
tok = lexer.token()
if not tok:
break
print(tok.type, tok.value, tok.lineno, tok.lexpos)
However, I get the following output:
~$ python3 lex.py
LexToken(error,'b_1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,0)
Illegal character 'b'
LexToken(error,'_1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,1)
Illegal character '_'
LexToken(error,'1 /\\ (x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,2)
Illegal character '1'
AND /\ 1 4
LPAREN ( 1 7
LexToken(error,'x_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,8)
Illegal character 'x'
LexToken(error,'_2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,9)
Illegal character '_'
LexToken(error,'2 <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,10)
Illegal character '2'
LexToken(error,'<= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,12)
Illegal character '<'
LexToken(error,'= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,13)
Illegal character '='
LexToken(error,'2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))',1,15)
Illegal character '2'
OR \/ 1 17
LPAREN ( 1 20
LexToken(error,'b_3 /\\ ((/ 1 3) <= x_4))',1,21)
Illegal character 'b'
LexToken(error,'_3 /\\ ((/ 1 3) <= x_4))',1,22)
Illegal character '_'
LexToken(error,'3 /\\ ((/ 1 3) <= x_4))',1,23)
Illegal character '3'
AND /\ 1 25
LPAREN ( 1 28
LPAREN ( 1 29
LexToken(error,'/ 1 3) <= x_4))',1,30)
Illegal character '/'
LexToken(error,'1 3) <= x_4))',1,32)
Illegal character '1'
LexToken(error,'3) <= x_4))',1,34)
Illegal character '3'
RPAREN ) 1 35
LexToken(error,'<= x_4))',1,37)
Illegal character '<'
LexToken(error,'= x_4))',1,38)
Illegal character '='
LexToken(error,'x_4))',1,40)
Illegal character 'x'
LexToken(error,'_4))',1,41)
Illegal character '_'
LexToken(error,'4))',1,42)
Illegal character '4'
RPAREN ) 1 43
RPAREN ) 1 44
The t_STRING token is not correctly recognized as it should.
Question: how to set the catch all regular expression for t_STRING so as to get a working tokenizer?

Your regular expression for T_STRING most certainly doesn't do what you want. What it does do is a little more difficult to answer.
In principle, it consists only of two zero-length assertions: ^, which is only true at the beginning of the string (unless you provide the re.MULTILINE flag, which you don't), and a long negative lookahead assertion.
A pattern which consists only of zero-length assertions can only match the empty string, if it matches anything at all. But lexer patterns cannot be allowed to match the empty string. Lexers divide the input into a series of tokens, so that every character in the input belongs to some token. Each match -- and they are all matches, not searches -- starts precisely at the end of the previous match. So if a pattern could match the empty string, the lexer would try the next match at the same place, with the same result, which would be an endless loop.
Some lexer generators solve this problem by forcing a minimum one-character match using a built-in catch-all error pattern, but Ply simply refuses to generate a lexer if a pattern matches the empty string. Yet Ply does not complain about this lexer specification. The only possible explanation is that the pattern cannot match anything.
The key is that Ply compiles all patterns using the re.VERBOSE flag, which allows you to separate items in regular expressions with whitespace, making the regexes slightly less unreadable. As the Python documentation indicates:
Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>.
Whitespace includes newlines and even comments (starting with a # character), so you can split patterns over several lines and insert comments about each piece.
We could do that, in fact, with your pattern:
def t_STRING(t):
r'''^ # Anchor this match at the beginning of the input
(?! # Don't match if the next characters match:
\) | # Close parenthesis
\( | # Open parenthesis
\ | # !!! HERE IS THE PROBLEM
\t | # Tab character
\n | # Newline character
\\\/ | # \/ token
\/\\ # /\ token
)
'''
t.value = t
return t
So as I added whitespace and comments to your pattern, I had to notice that the original pattern attempted to match a space character as an alternative with | |. But since the pattern is compiled as re.VERBOSE, that space character is ignored, leaving an empty alternative, which matches the empty string. That alternative is part of a negative lookahead assertion, which means that the assertion will fail if the string to match at that point starts with the empty string. Of course, every string starts with the empty string, so the negative lookahead assertion always fails, explaining why Ply didn't complain (and why the pattern never matches anything).
Regardless of that particular glitch, the pattern cannot be useful because, as mentioned already, a lexer pattern must match some characters, and so a pattern which only matches the empty string cannot be useful. What we want to do is match any character, providing that the negative lookahead (corrected, as below) allows it. So that means that the negative lookahead assertion show be followed with ., which will match the next character.
But you almost certainly don't want to match just one character. Presumably you wanted to match a string of characters which don't match any other token. So that means putting the negative lookahead assertion and the following . into a repetition. And remember that it needs to be a non-empty repetition (+, not *), because patterns must not have empty matches.
Finally, there is absolutely no point using an anchor assertion, because that would limit the pattern to matching only at the beginning of the input, and that is certainly not what you want. It's not at all clear what it is doing there. (I've seen recommendations which suggest using an anchor with a negative lookahead search, which I think are generally misguided, but that discussion is out of scope for this question.)
And before we write the pattern, let's make one more adjustment: in a Python regular expression, if you can replace a set of alternatives with a character class, you should do so because it is a lot more efficient. That's true even if only some of the alternatives can be replaced.
So that produces the following:
def t_STRING(t):
r'''(
(?! # Don't match if the next characters match:
[() \t\n] | # Parentheses or whitespace
\\\/ | # \/ token
\/\\ # /\ token
) . # If none of the above match, accept a character
)+ # and repeat as many times as possible (at least once)
'''
return t
I removed t.value = t. t is a token object, not a string, and the value should be the string it matched. If you overwrite the value with a circular reference, you won't be able to figure out which string was matched.
This works, but not quite in the way you intended. Since whitespace characters are excluded from T_STRING, you don't get a single token representing (/ 1 3) <= x_4. Instead, you get a series of tokens:
STRING b_1 1 0
AND /\ 1 4
LPAREN ( 1 7
STRING x_2 1 8
STRING <= 1 12
STRING 2 1 15
OR \/ 1 17
LPAREN ( 1 20
STRING b_3 1 21
AND /\ 1 25
LPAREN ( 1 28
LPAREN ( 1 29
STRING / 1 30
STRING 1 1 32
STRING 3 1 34
RPAREN ) 1 35
STRING <= 1 37
STRING x_4 1 40
RPAREN ) 1 43
RPAREN ) 1 44
But I think that's reasonable. How could the lexer be able to tell that the parentheses in (x_2 <= 2 and (b_3 are parenthesis tokens, while the parentheses in (/ 1 3) <= x_4 are part of T_STRING? That determination will need to be made in your parser.
In fact, my inclination would be to fully tokenise the input, even if you don't (yet) require a complete tokenisation. As this entire question and answer shows, attempting to recognised "everything but..." can actually be a lot more complicated than just recognising all tokens. Trying to get the tokeniser to figure out which tokens are useful and which ones aren't is often more difficult than tokenising everything and passing it through a parser.

Based on the excellent answer from #rici, pointing the problem with t_STRING, this is my final version of the example that introduces smaller changes to the one proposed by #rici.
Code
##############
# TOKENIZING #
##############
tokens = (
"LPAREN",
"RPAREN",
"AND",
"OR",
"STRING",
)
def t_AND(t):
r'[ ]*\/\\[ ]*'
t.value = "/\\"
return t
def t_OR(t):
r'[ ]*\\\/[ ]*'
t.value = "\\/"
return t
def t_LPAREN(t):
r'[ ]*\([ ]*'
t.value = "("
return t
def t_RPAREN(t):
r'[ ]*\)[ ]*'
t.value = ")"
return t
def t_STRING(t):
r'''(
(?! # Don't match if the next characters match:
[()\t\n] | # Parentheses or whitespace
\\\/ | # \/ token
\/\\ # /\ token
) . # If none of the above match, accept a character
)+ # and repeat as many times as possible (at least once)
'''
return t
def t_error(t):
print("error: " + str(t.value[0]))
t.lexer.skip(1)
import ply.lex as lex
lexer = lex.lex()
data = "b_b /\\ (ccc <= 2 \\/ (b_3 /\\ ((/ 1 3) <= x_4))"
lexer.input(data)
while True:
tok = lexer.token()
if not tok:
break
print("{0}: `{1}`".format(tok.type, tok.value))
Output
STRING: `b_b `
AND: `/\`
LPAREN: `(`
STRING: `ccc <= 2 `
OR: `\/`
LPAREN: `(`
STRING: `b_3 `
AND: `/\`
LPAREN: `(`
LPAREN: `(`
STRING: `/ 1 3`
RPAREN: `)`
STRING: `<= x_4`
RPAREN: `)`
RPAREN: `)`

c# Regex mask for textbox decimal precision 1 min value 1 max value 100

Does anyone has the regex mask for a textbox where it allows decimal precision 1 with min value of 1 and max value of 100.
Values that need to pass:
0,5
0,1
10,5
99,5
100
basicly every value between 0,1 and 100

Give this pattern a try
\d{0,3},?\d*
Pattern breakdown:
\d{0,3} - 0 to 3 digits
,? - 0 to 1 comma
\d* - 0 or more digits
Tested at Regex101

To match every value between 0,1 and 100 and allow a decimal precision of 1, you could match either: 100 with an optional ,0 or 1 - 99 with an optional 1 decimal precision of 0-9 or match a 0 with 1 decimal precision from 1-9 so it does not match 0,0 using an alternation.
^(?:[1-9][0-9]?(?:,[0-9])?|0,[1-9]|(?:100(?:,0)?))$
Explanation
^ Assert start of the line
(?: Non capturing group
[1-9][0-9]?(?:,[0-9] Match 1 - 99 followed by an optional comma and digit 0-9
| Or
0,[1-9] Match a zero and a comma followed by a digit 1-9 so 0,0 does not match
| Or
(?:100(?:,0)?) Match 100 with an optional comma and 0
) Close non capturing group
$ Assert end of the line
Demo

How to select the sequence of digits from string in Firebird 2.5?

I have strings like:
26.05.2016/00002Lol
26.05.2016/00003(Lol)
26.05.2016UUUU/00004(Lol)
How to select the sequence of five digits (00002, 00003, 00004) from these strings?

Are those five digits always after the first '/'? If so, then:
SELECT SUBSTRING(col FROM POSITION(col, '/') + 1 FOR 5) AS fivedigits ...
ought to do the trick.

Description
(?<=\/)[0-9]{5}
This regular expression will do the following:
capture the first 5 digits after the \ character
Example
Live Demo
https://regex101.com/r/lD6pW5/1
Sample text
26.05.2016/00002Lol
26.05.2016/00003(Lol)
26.05.2016UUUU/00004(Lol)
Sample Matches
[0][0] = 00002
[1][0] = 00003
[2][0] = 00004
Explanation
NODE EXPLANATION
----------------------------------------------------------------------
(?<= look behind to see if there is:
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
) end of look-behind
----------------------------------------------------------------------
[0-9]{5} any character of: '0' to '9' (5 times)
----------------------------------------------------------------------

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js