Parsing digits and decimals out of string with re

Parsing digits and decimals out of string with re - regex

I have a string that looks like this:
'Home Cookie viewed item "yada_yada.mov" (22.4338.241384081)'
I need to parse the last set of numbers, the ones between the last period and the closing paren (in this case, 241384081) out of the string, keeping in mind that there may be one or more sets of parenthesis in the filename "yada_yada.mov."
So far I have this:
mo = re.match('.*([0-9])\)$', data1)
...where data1 is the string. But that is only returning the very last digit.
Any help, please?
Thanks!

You may use
(\d[\d.]*)\)$
See the regex demo.
Details
(\d[\d.]*) - Capturing group 1: a digit and then any amount of . and digits, 0 or more times
\) - a )
$ - end of string.
See the Python demo:
import re
s='Home Cookie viewed item "yada_yada.mov" (22.4338.241384081)'
m = re.search(r'(\d[\d.]*)\)$', s)
if m:
print(m.group(1)) # => 22.4338.241384081
# print(m.group(1).replace(".", "")) # => 224338241384081
Alternative patterns:
(\d+(?:\.\d+)*)\)$ # To match digits and then 0 or more repetitions of . + digits
(\d+(?:\.\d+)*)\)\s*$ # To allow any 0+ trailing whitespaces

Related

Regex : Match Character if follow by string ccontainning himself

I try to match the first character '.' in '-2.232.232'.
I am close to it by this regex
^[^\.]*(\.)(?=.*\.)
but it match '-2.' insted of '.'.
Thank you very much

You can use Series.str.replace with the n argument set to 1:
import pandas as pd
df = pd.DataFrame({'Data':['-2.232.232']})
df['Data'].str.replace(r"\.(?=[^.]*\.)", "", n=1)
# => 0 -2232.232
Here,
\.(?=[^.]*\.) - matches a dot that is followed with any zero or more chars other than a dot and then a dot char
n=1 - sets the number of replacements. n=1 means only one replacement.
Alternatively, you may use
>>> df['Data'].str.replace(r"^([^.]*)\.(?=[^.]*\.)", r"\1", n=1)
0 -2232.232
Here, ^([^.]*) matches and captures into Group 1 any zero or more chars other than . from the start of the string, and the \1 refers to that value from the replacement pattern.

Or between groups when one group has to be preceeded by a character

I have the following data:
$200 – $4,500
Points – $2,500
I would like to capture the ranges in dollars, or capture the Points string if that is the lower range.
For example, if I ran my regex on each of the entries above I would expect:
Group 1: 200
Group 2: 4,500
and
Group 1: Points
Group 2: 2,500
For the first group, I can't figure out how to capture only the integer value (without the $ sign) while allowing for capturing Points.
Here is what I tried:
(?:\$([0-9,]+)|Points) – \$([0-9,]+)
https://regex101.com/r/mD9JeR/1

Just use an alternation here:
^(?:(Points)|\$(\d{1,3}(?:,\d{3})*)) - \$(\d{1,3}(?:,\d{3})*)$
Demo
The salient points of the above regex pattern are that we use an alternation to match either Points or a dollar amount on the lower end of the range, and we use the following regex for matching a dollar amount with commas:
\$\d{1,3}(?:,\d{3})*

Coming up with a regex that doesn't match the $ is not difficult. Coming up with a regex that doesn't match the $ and consistently puts the two values, whether they are both numeric or one of them is Points, as capture groups 1 and 2 is not straightforward. The difficulties disappear if you use named capture groups. This regex requires the regex module from the PyPi repository since it uses the same named groups multiple times.
import regex
tests = [
'$200 – $4,500',
'Points – $2,500'
]
re = r"""(?x) # verbose mode
^ # start of string
(
\$ # first alternate choice
(?P<G1>[\d,]+) # named group G1
| # or
(?P<G1>Points) # second alternate choice
)
\x20–\x20 # ' – '
\$
(?P<G2>[\d,]+) # named group g2
$ # end of string
"""
# or re = r'^(\$(?P<G1>[\d,]+)|(?P<G1>Points)) – \$(?P<G2>[\d,]+)$'
for test in tests:
m = regex.match(re, test)
print(m.group('G1'), m.group('G2'))
Prints:
200 4,500
Points 2,500
UPDATE
#marianc was on the right track with his comment but did not ensure that there were no extraneous characters in the input. So, with his useful input:
import re
tests = [
'$200 – $4,500',
'Points – $2,500',
'xPoints – $2,500',
]
rex = r'((?<=^\$)\d{1,3}(?:,\d{3})*|(?<=^)Points) – \$(\d{1,3}(?:,\d{3})*)$'
for test in tests:
m = re.search(rex, test)
if m:
print(test, '->', m.groups())
else:
print(test, '->', 'No match')
Prints:
$200 – $4,500 -> ('200', '4,500')
Points – $2,500 -> ('Points', '2,500')
xPoints – $2,500 -> No match
Note that a search rather than a match is done since a lookbehind assertion done at the beginning of the line cannot succeed. But we enforce no extraneous characters at the start of the line by including the ^ anchor in our lookbehind assertion.

For the first capturing group, you could use an alternation matching either Points and assert what is on the left is a non whitespace char, or match the digits with an optional decimal value asserting what is on the left is a dollar sign using a positive lookbehind if that is supported.
For the second capturing group, there is no alternative so you can match the dollar sign and capture the digits with an optional decimal value in group 2.
((?<=\$)\d{1,3}(?:,\d{3})*|(?<!\S)Points) – \$(\d{1,3}(?:,\d{3})*)
Explanation
( Capture group 1
(?<=\$)\d{1,3}(?:,\d{3})* Positive lookbehind, assert a $ to the left and match 1-3 digits and repeat 0+ matching a comma and 3 digits
| Or
(?<!\S)Points Positive lookbehind, assert a non whitespace char to the left and match Points
) Close group 1
– Match literally
\$ Match $
( Capture group 2
\d{1,3}(?:,\d{3})* Match 1-3 digits and 0+ times a comma and 3 digits
) Close group
Regex demo

Regex for parse name with one or more words after double number and before 2 or more spaces

Problem:
How create regex to parse "DISNAY LAND 2.0 GCP" like name from Array of lines in Scala like this:
DE1ALAT0002 32.4756 -86.4393 106.1 ZQ DISNAY LAND 2.0 GCP 23456
//For using in code:
val regex = """(?:[\d\.\d]){2}\s*(?:[\d.\d])\s*(ZQ)\s*([A-Z])""".r . // my attempt
val getName = row match {
case regex(name) => name
case _ =>
}
I'm sure only in:
1) there is different number of spaces between values
2) useful value "DISNAY LAND 2.0 GCP" come after double number and "ZQ" letters
3) name separating with one space and may consist of one or many words
4) name ending with two or more spaces
sorry if I repeat the question, but after a long search I did not find the right solution
Many thank for answers

You may use an .unanchored pattern like
\d\.\d+\s+ZQ\s+(\S+(?:\s\S+)*)
See the regex demo. Details
\d\.\d+ - 1 digit, . and then 1+ digits
\s+ - 1+ whitespaces
ZQ - ZQ substring
\s+ - 1+ whitespaces (here, the left-hand side context definition ends, now, starting to capture the value we need to return)
(\S+(?:\s\S+)*) - Capturing group 1:
\S+ - 1 or more non-whitespace chars
(?:\s\S+)* - a non-capturing group that matches 0 or more sequences of a single whitespace (\s) and then 1+ non-whitespace chars (so, up to the double whitespace or end of string).
Scala demo:
val regex = """\d\.\d+\s+ZQ\s+(\S+(?:\s\S+)*)""".r.unanchored
val row = "DE1ALAT0002 32.4756 -86.4393 106.1 ZQ DISNAY LAND 2.0 GCP 23456"
val getName = row match {
case regex(name) => name
case _ =>
}
print(getName)
Output: DISNAY LAND 2.0 GCP

Regex pattern matching in python

I am trying to split the data
rest = [" hgod eruehf 10 SECTION 1. DATA: find my book 2.11.111 COLUMN: get me tea","111.2 CONTAIN i am good"]
match = re.compile(r'(((\d[.])(\d[.]))+\s(\w[A-Z]+:|\w+))')
out = match.search(rest)
print(out.group(0))
I found the pattern as "multiple decimal digit(eg:1. / 1.1. / 1.21.1 etc.,) followed by character till another multiple decimal digit(eg:1. / 1.1. / 1.21.1 etc.,) "
I want to split the data as
DATA: find my book
2.11.111 COLUMN: get me tea
111.2 CONTAIN i am good
Is there any way to split the text data based on the pattern.

You may get the expected matches using
import re
rest = [" hgod eruehf 10 SECTION 1. DATA: find my book 2.11.111 COLUMN: get me tea","111.2 CONTAIN i am good"]
res = []
for s in rest:
res.extend(re.findall(r'\d+(?=\.)(?:\.\d+)*.*?(?=\s*\d+(?=\.)(?:\.\d+)*|\Z)', s))
print(res)
# => ['1. DATA: find my book', '2.11.111 COLUMN: get me tea', '111.2 CONTAIN i am good']
See the Python demo
The regex is applied to each item in the rest list and all matches are saved into res list.
Pattern details
\d+ - 1+ digits
(?=\.) - there must be a . immediately to the right of the current position
(?:\.\d+)* - 0 or more repetitions of a . and then 1+ digits
.*? - 0+ chars other than newline, as few as possible
(?=\s*\d+(?=\.)(?:\.\d+)*|\Z) - up to the 0+ whitespaces, 1+ digits with a . immediately to the right of the current position, 0 or more repetitions of a . and then 1+ digits, or end of string

Selecting if no delimiter, and no selecting if it is

I have string like "smth 2sg. smth", and sometimes "smth 2sg.| smth.".
What mask should I use for selecting "2sg." if string does not contains"|", and select nothing if string does contains "|"?

I have 2 methods. They both use something called a Negative Lookahead, which is used like so:
(?!data)
When this is inserted into a RegEx, it means if data exists, the RegEx will not match.
More info on the Negative Lookahead can be found here
Method 1 (shorter)
Just capture 2sg.
Try this RegEx:
(\dsg\.)(?!\|)
Use (\d+... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
Method 2 (longer but safer)
Match the whole string and capture 2sg.
Try this RegEx:
^\w+\s*(\dsg\.)(?!\|)\s*\w+\.?$
Use (\d+sg... if the number could be longer than 1 digit
Live Demo on RegExr
How it works:
^ # String starts with ...
\w+\s* # Letters then Optional Whitespace (smth )
( # To capture (2sg.)
\d # Digit (2)
sg # (sg)
\. # . (Dot)
)
(?!\|) # Do not match if contains |
\s* # Optional Whitespace
\w+ # Letters (smth)
\.? # Optional . (Dot)
$ # ... Strings ends with

Something like this might work for you:
(\d*sg\.)(?!\|)
It assumes that there is(or there is no)number followed by sg. and not followed by |.

^.*(\dsg\.)[^\|]*$
Explanation:
^ : starts from the beginning of the string
.* : accepts any number of initial characters (even nothing)
(\dsg\.) : looks for the group of digit + "sg."
[^\|]* : considers any number of following characters except for |
$ : stops at the end of the string
You can now select your string by getting the first group from your regex

Try:
(\d+sg.(?!\|))
depending on your programming environment, it can be little bit different but will get your result.
For more information see Negative Lookahead

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing digits and decimals out of string with re - regex

Related

Regex : Match Character if follow by string ccontainning himself

Or between groups when one group has to be preceeded by a character

Regex for parse name with one or more words after double number and before 2 or more spaces

Regex pattern matching in python

Selecting if no delimiter, and no selecting if it is

Categories

Resources