Regex get certain information - regex

I've stumbled with certain types of rows.
I parse this information
195/75 R 16 C X Wonder Van 110/108R 10PR Tourador
The groups, which I need
I've got the following regex
([0-9]+)?\/([0-9]+)\s*\w\s*([0-9]+(?:\.\d+)?)\s*(C\s+)?(.+\s+?(?=[0-9]{2,3}|(\d{2,3}\/\d{2,3})))(?:(\d{2,3}\/\d{2,3})|(\d{2,3}))\s*(\w)(.*)
It works nicely for all kinds od rows, e.g
225/55 R18 X Speed TU1 98V Toradfor
225/50 R 16 X Wonder TH1 96W XL Tourador
195/75 R 16 C X Wonder Van 110/108R 8PR Tourador
However, it doesn't work for
195/75 R 16 C X Wonder Van 110/108R 10PR Tourador
because of 10PR, where 10 consists of 2 digits
how it works now
Thank you!

In you pattern you use alternations | that can match and capture unrelated parts in the strings.
What you could do is use anchors and and an optional capture group
For all the given example strings you might use:
^(\d+)\/(\d+)\s+[A-Z]*\s*(\d+)\s*([A-Z])(.*?)(\d+\/\d+([A-Z]+))?\s+(\d+[A-Z]+\s+.*)$
The pattern in parts:
^ Start of string
(\d+)\/(\d+)\s+ Capture 2 times 1+ digits in a group
[A-Z]*\s* Match optional chars A-Z and optional whitspace chars
(\d+)\s* Capture 1+ digits in a group and match optional whitespace chars
([A-Z]) Capture a single char A-Z in a group
(.*?) Capture as few as possible chars in a group
( Capture group
\d+\/\d+ Match 1+ digits / and 1+ digits
([A-Z]+) Capture 1+ chars A-Z
)? Close the capture group and make it optional
\s+ Match 1+ whitespace chars
(\d+[A-Z]+\s+.*) Capture group, match 1+ digits, 1+ chars A-Z, 1+ whitespce chars and the rest of the line
$ End of string
Regex demo

Related

Regex match specific strings

I want to capture all the strings from multi lines data. Supposed here the result and here’s my code which does not work.
Pattern: ^XYZ/[0-9|ALL|P] I’m lost with this part anyone can help?
Result
XYZ/1
XYZ/1,2-5
XYZ/5,7,8-9
XYZ/2-4,6-8,9
XYZ/ALL
XYZ/P1
XYZ/P2,3
XYZ/P4,5-7
XYZ/P1-4,5-7,8-9
Changed to
XYZ/1
XYZ/1,2-5
XYZ/5,7,8-9
XYZ/2-4,6-8,9
XYZ/A12345 after the slash limited to 6 alphanumeric chars
XYZ/LH-1234567890 after the /LH- limited to 10 numeric chars
The pattern could be:
^XYZ\/(?:ALL|P?[0-9]+(?:-[0-9]+)?(?:,[0-9]+(?:-[0-9]+)?)*)$
The pattern in parts matches:
^ Start of string
XYZ\/ Match XYX/ (You don't have to escape the / depending on the pattern delimiters)
(?: Outer on capture group for the alternatives
ALL Match literally
| Or
P? Match an optional P
[0-9]+(?:-[0-9]+)? Match 1+ digits with an optional - and 1+ digits
(?: Non capture group to match as a whole
,[0-9]+(?:-[0-9]+)? Match ,and 1+ digits and optional - and 1+ digits
)* Close the non capture group and optionally repeat it
) Close the outer non capture group
$ End of string
Regex demo
You can use this regex pattern to match those lines
^XYZ\/(?:P|ALL|[0-9])[0-9,-]*$
Use the global g and multiline m flags.
Btw, [P|ALL] doesn't match the word "ALL".
It only matches a single character that's a P or A or L or |.

How can I extract non digit characters and digit characters in the end of a string?

I have a string that has the following structure:
digit-word(s)-digit.
For example:
2029 AG.IZTAPALAPA 2
I want to extract the word(s) in the middle, and the digit at the end of the string.
I want to extract AG.IZTAPALAPA and 2 in the same capture group to extract like:
AG.IZTAPALAPA 2
I managed to capture them as individual capture groups but not as a single:
town_state['municipality'] = town_state['Town'].str.extract(r'(\D+)', expand=False)
town_state['number'] = town_state['Town'].str.extract(r'(\d+)$', expand=False)
Thank you for your help!
Yo can use a single capturing group for the example string to match a single "word" that consists of uppercase chars A-Z with an optional dot in the middle which can not be at the start or end followed by 1 or more digits.
\b\d+ ([A-Z]+(?:\.[A-Z]+)* \d+)\b
Explanation
\b A word boundary
\d+
( Capture group 1
[A-Z]+ Match 1+ occurrences of an uppercase char A-Z
(?:\.[A-Z]+)* \d+ Repeat 0+ times matching a dot and a char A-Z followed by matching 1+ digits
) Close group 1
\b A word boundary
Regex demo
Or you can make the pattern a bit broader matching either a dot or a word character
\b\d+ ([\w.]+(?: [\w.]+)* \d+)\b
Regex demo
You can use the following simple regex:
[0-9]+\s([A-Z]+.[A-Z]+(?: [0-9]+)*)
Note:
(?: [0-9]+)* will make it the last digital optional.

RegEx optional group with optional sub-group

I have a set of strings with fairly inconsistent naming, that should be structured enough to be divided into groups though.
Here's an excerpt:
test test 1970-2020 w15.txt
test 1970-2020 w15.csv
test 1990-99 q1 .txt
test 1981 w15 .csv
test test w15.csv
I am trying to extract information by groups (test-name, (year)?, suffix, type) using the following RegEx:
(.*)\s+([0-9]+(\-[0-9]+)?\s+)?((w|q)[0-9]+(\s+)?)(\..*)$
It works except for the optional group matching the years (interval of year's, single year or no year at all).
What am I missing to make the pattern work?
Here's also a link to RegEx101 for testing:
https://regex101.com/r/wG3aM3/817
You could make the pattern a bit more specific and make the content of the year optional
^(.*?)\s+((?:\d{4}(?:-(?:\d{4}|\d{2}))?)?)\s+([wq][0-9]+)\s*(\.\w+)$
Explanation
^ Start of string
(.*?) Capture group 1 Match 0+ times any char except a newline non greedy
\s+ Match 1+ whitespace chars
( Capture group 2
(?: Non capture group
\d{4}(?:-(?:\d{4}|\d{2}))? Match 4 digits and optionally - and 2 or 4 digits
)? Close non capture group and make the year optional
) Close group 2
\s+ Match 1+ whitespace chars
([wq][0-9]+) Capture group 3 Match either w or q and 1+ digits 0-9
\s* Match 0+ whitespace chars
(\.\w+) Capture group 4, match a dot and 1+ word characters
$ End of string
Regex demo
Note that \s could also match a newline.

Task for matching floating point numbers

Task:
MATCH:
3.45
5,4
.45
3e4
,54
4
4.
4,
DON'T MATCH:
4,5e
2e
.3.
2e,4
,4.
d34
2.45t
2,45.
Currently i came up with the following:
(?<=\s|^)[-+]?(?:(?:[.,]?\d+[.,]?\d*[eE]\d+(?!\w|[.,]))|[.,]?\d+[.,]?\d*(?!\w|[.,]))\b
That works for almost everything, except 2 last numbers (4. and 4,) and got stucked
You may use
(?<!\S)[-+]?[0-9]*(?:[.,]?[0-9]+(?:[eE][-+]?[0-9]+)?|(?<=\d)[,.])(?!\S)
See the regex demo
Details
(?<!\S) - start of string or a whitespace must appear immediately to the left
[-+]? - an optional + or -
[0-9]* - 0+ digits
(?:[.,]?[0-9]+(?:[eE][-+]?[0-9]+)?|[,.]) - either
[.,]?[0-9]+(?:[eE][-+]?[0-9]+)? - an optional . or ,, then 1+ digits, then an optional sequence of e or E, followed with an optional . or , and 1+ digits
| - or
(?<=\d)[,.] - a dot or comma only if preceded with a digit (to avoid matching standalone . or ,)
(?!\S) - end of string or a whitespace must appear immediately to the right.
Regex graph:
You could use an alternation to match 1+ digits followed by a dot or comma and 0+ digits or match the Ee part followed by 1+ digits.
Or match starting with a dot or comma followed by 1+ digits.
If this is the only thing to match on the line, you could use anchors ^ and $ or use lookarounds to assert that there are no non whitespace chars on the left and right.
(?<!\S)(?:\d+(?:[.,]\d*|[eE]\d+)?|[.,]\d+)(?!\S)
Pattern parts
(?<!\S) Assert what is directly to the left is non a non whitespace char
(?: Non capturing group
\d+ Match 1+ digits
(?: Non capturing group
[.,]\d* Match either . or , and 0+ digits
| Or
[eE]\d+ Match e or E and 1+ digits
)? Close group and make it optional
| Or
[.,]\d+ Match . or , and 1+ digits
) Close group
(?!\S) Assert what is directly to the right is non a non whitespace char
Regex demo

Regular Expression not grouping

Need help with this regex
ABC 130 zlis 02-03/12 N180 Grouping req
A B Csd 130 pain 02/12 I80 alias
(\w+\s{0,3})(\d+)
The regex does not seem to group as I need it to.
Desired Output, brackests are the groups im trying to detect.
(A B Csd) (130) (pain) (02/12) (I80) (alias)
Try this regex:
([a-z ]+?)\s+(\d+)\s+([a-z]+)\s+([\d-\/]+)\s+([\w ]+)
Click for Demo
Explanation:
([a-z ]+?) - match 1+ occurrences(as few as possible) of a letter or a space and capture it as Group1
\s+ - matches 1+ occurrences of a whitespace character
(\d+) - match 1+ occurrences of digits and capture as Group2
\s+ - matches 1+ occurrences of a whitespace character
([a-z]+) - match 1+ occurrences of a letter and Capture as Group 3
\s+ - matches 1+ occurrences of a whitespace character
([\d-\/]+) - match 1+ occurrences of a digit or - or / and capture it as Group4
\s+ - matches 1+ occurrences of a whitespace character
([\w ]+) - match 1+ occurrences of a word-character or a space and capture as Group5
Note that I have used the g, i, m flags for Global matches, Case-insensitive and Multiline respectively.