RegEx optional group with optional sub-group - regex

I have a set of strings with fairly inconsistent naming, that should be structured enough to be divided into groups though.
Here's an excerpt:
test test 1970-2020 w15.txt
test 1970-2020 w15.csv
test 1990-99 q1 .txt
test 1981 w15 .csv
test test w15.csv
I am trying to extract information by groups (test-name, (year)?, suffix, type) using the following RegEx:
(.*)\s+([0-9]+(\-[0-9]+)?\s+)?((w|q)[0-9]+(\s+)?)(\..*)$
It works except for the optional group matching the years (interval of year's, single year or no year at all).
What am I missing to make the pattern work?
Here's also a link to RegEx101 for testing:
https://regex101.com/r/wG3aM3/817

You could make the pattern a bit more specific and make the content of the year optional
^(.*?)\s+((?:\d{4}(?:-(?:\d{4}|\d{2}))?)?)\s+([wq][0-9]+)\s*(\.\w+)$
Explanation
^ Start of string
(.*?) Capture group 1 Match 0+ times any char except a newline non greedy
\s+ Match 1+ whitespace chars
( Capture group 2
(?: Non capture group
\d{4}(?:-(?:\d{4}|\d{2}))? Match 4 digits and optionally - and 2 or 4 digits
)? Close non capture group and make the year optional
) Close group 2
\s+ Match 1+ whitespace chars
([wq][0-9]+) Capture group 3 Match either w or q and 1+ digits 0-9
\s* Match 0+ whitespace chars
(\.\w+) Capture group 4, match a dot and 1+ word characters
$ End of string
Regex demo
Note that \s could also match a newline.

Related

Regex get certain information

I've stumbled with certain types of rows.
I parse this information
195/75 R 16 C X Wonder Van 110/108R 10PR Tourador
The groups, which I need
I've got the following regex
([0-9]+)?\/([0-9]+)\s*\w\s*([0-9]+(?:\.\d+)?)\s*(C\s+)?(.+\s+?(?=[0-9]{2,3}|(\d{2,3}\/\d{2,3})))(?:(\d{2,3}\/\d{2,3})|(\d{2,3}))\s*(\w)(.*)
It works nicely for all kinds od rows, e.g
225/55 R18 X Speed TU1 98V Toradfor
225/50 R 16 X Wonder TH1 96W XL Tourador
195/75 R 16 C X Wonder Van 110/108R 8PR Tourador
However, it doesn't work for
195/75 R 16 C X Wonder Van 110/108R 10PR Tourador
because of 10PR, where 10 consists of 2 digits
how it works now
Thank you!
In you pattern you use alternations | that can match and capture unrelated parts in the strings.
What you could do is use anchors and and an optional capture group
For all the given example strings you might use:
^(\d+)\/(\d+)\s+[A-Z]*\s*(\d+)\s*([A-Z])(.*?)(\d+\/\d+([A-Z]+))?\s+(\d+[A-Z]+\s+.*)$
The pattern in parts:
^ Start of string
(\d+)\/(\d+)\s+ Capture 2 times 1+ digits in a group
[A-Z]*\s* Match optional chars A-Z and optional whitspace chars
(\d+)\s* Capture 1+ digits in a group and match optional whitespace chars
([A-Z]) Capture a single char A-Z in a group
(.*?) Capture as few as possible chars in a group
( Capture group
\d+\/\d+ Match 1+ digits / and 1+ digits
([A-Z]+) Capture 1+ chars A-Z
)? Close the capture group and make it optional
\s+ Match 1+ whitespace chars
(\d+[A-Z]+\s+.*) Capture group, match 1+ digits, 1+ chars A-Z, 1+ whitespce chars and the rest of the line
$ End of string
Regex demo

Regex match specific strings

I want to capture all the strings from multi lines data. Supposed here the result and here’s my code which does not work.
Pattern: ^XYZ/[0-9|ALL|P] I’m lost with this part anyone can help?
Result
XYZ/1
XYZ/1,2-5
XYZ/5,7,8-9
XYZ/2-4,6-8,9
XYZ/ALL
XYZ/P1
XYZ/P2,3
XYZ/P4,5-7
XYZ/P1-4,5-7,8-9
Changed to
XYZ/1
XYZ/1,2-5
XYZ/5,7,8-9
XYZ/2-4,6-8,9
XYZ/A12345 after the slash limited to 6 alphanumeric chars
XYZ/LH-1234567890 after the /LH- limited to 10 numeric chars
The pattern could be:
^XYZ\/(?:ALL|P?[0-9]+(?:-[0-9]+)?(?:,[0-9]+(?:-[0-9]+)?)*)$
The pattern in parts matches:
^ Start of string
XYZ\/ Match XYX/ (You don't have to escape the / depending on the pattern delimiters)
(?: Outer on capture group for the alternatives
ALL Match literally
| Or
P? Match an optional P
[0-9]+(?:-[0-9]+)? Match 1+ digits with an optional - and 1+ digits
(?: Non capture group to match as a whole
,[0-9]+(?:-[0-9]+)? Match ,and 1+ digits and optional - and 1+ digits
)* Close the non capture group and optionally repeat it
) Close the outer non capture group
$ End of string
Regex demo
You can use this regex pattern to match those lines
^XYZ\/(?:P|ALL|[0-9])[0-9,-]*$
Use the global g and multiline m flags.
Btw, [P|ALL] doesn't match the word "ALL".
It only matches a single character that's a P or A or L or |.

Regex start with any number and it should end without zeros

I am trying to create a Regex with groups that should group 1234.0500- to 1234.05-.
What I have tried is:
^([0-9]+)(\.)([1-9]*)0*(-?)$
but it does not match 1234.0500-. Here is the example https://regex101.com/r/koSZoB/1. The regex should also group
1234.0000
0.9000
to
1234
0.9
In your pattern, this part ([1-9]*)0*(-?)$ matches optional digits 1-9 followed by optional zeroes and then an optional hyphen at the end of the string. It will succeed until the first zero:
0500
^
But the match will fail as it can not match (-?)$
You could use 3 capturing groups and use those in the replacement.
After group 1, you could either match a dot followed by only zeroes which should be removed, or capture in group 2 matching from the dot till the lats digits 1-9 and remove the trailing zeroes.
^(\d+)(?:\.0+|(\.\d*[1-9])0+)(-?)$
Explanation
^ Start of string
(\d+) Capture group 1, match 1+ digits
(?: Non capture group, match either
\.0+ Match a . and 1+ zeroes
| Or
(\.\d*[1-9])0+ Capture ., 0+ digits followed by a digit 1-9 and match the following 1+ zeroes to be removed
) Close group
(-?) Capture optional -
$ End of string
Regex demo
There is no language tagged, but for example in Javascript
const pattern = /^(\d+)(?:\.0+|(\.\d*[1-9])0+)(-?)$/;
[
"1234.0500-",
"1234.05500-",
"1234.0550588500-",
"1234.0000",
"0.9000",
"12.1222",
"12.1222-",
].forEach(s => console.log(s.replace(pattern, "$1$2$3")));
The third capture group doesn't include zeroes meaning that the 0 in 05 is making the match fail.
I would suggest making the third capture group non-greedy by adding a ?: ^([0-9]+)(\.)([0-9]*?)0*(-?)$ This will make it match the minimum amount of zeroes possible instead of the maximum. With the last group being greedy it should work.

How can I extract non digit characters and digit characters in the end of a string?

I have a string that has the following structure:
digit-word(s)-digit.
For example:
2029 AG.IZTAPALAPA 2
I want to extract the word(s) in the middle, and the digit at the end of the string.
I want to extract AG.IZTAPALAPA and 2 in the same capture group to extract like:
AG.IZTAPALAPA 2
I managed to capture them as individual capture groups but not as a single:
town_state['municipality'] = town_state['Town'].str.extract(r'(\D+)', expand=False)
town_state['number'] = town_state['Town'].str.extract(r'(\d+)$', expand=False)
Thank you for your help!
Yo can use a single capturing group for the example string to match a single "word" that consists of uppercase chars A-Z with an optional dot in the middle which can not be at the start or end followed by 1 or more digits.
\b\d+ ([A-Z]+(?:\.[A-Z]+)* \d+)\b
Explanation
\b A word boundary
\d+
( Capture group 1
[A-Z]+ Match 1+ occurrences of an uppercase char A-Z
(?:\.[A-Z]+)* \d+ Repeat 0+ times matching a dot and a char A-Z followed by matching 1+ digits
) Close group 1
\b A word boundary
Regex demo
Or you can make the pattern a bit broader matching either a dot or a word character
\b\d+ ([\w.]+(?: [\w.]+)* \d+)\b
Regex demo
You can use the following simple regex:
[0-9]+\s([A-Z]+.[A-Z]+(?: [0-9]+)*)
Note:
(?: [0-9]+)* will make it the last digital optional.

regex combination of two lookaround - regexstorm.net

I have to collect two informantion from a text using regex. The name and the database and relate then in one table. But a can only collect then individually.
This is an example, i have many blocks of these, and two of then don't have a database value, these i need to ingnore
[SCD] {I need the name between []}
Driver=/opt/pcenter/pc961/ODBC7.1/lib/DWmsss27.so
Description=
Database=scd {I need the value after Defaut|Database}
Address=#######
LogonID=######
Password=######
QuoteId=No
AnsiNPW=No
ApplicationsUsingThreads=1
The regex to find the name is:
(?<=\[)(.*)(?=\])
The regex to find the value after database is
(?<=Defaut|Database=)(.*)
How can i combine both of then into onde regex ?
To match both values you could use 2 capturing groups instead and use a repeating pattern and a negative lookahead to check if a line do not start with Default of Database until the line does.
\[([^]]+)\](?:\r?\n(?!Default|Database).*)*\r?\n(?:Default|Database)=(\S+)
About the pattern
\[ Match [
( Capture group 1
[^]]+ match 1+ times not ]
) Close group 1
\] Match ]
(?: Non capturing group
\r?\n Match newline,
(?! Negative lookahead, assert what is directly on the right is not
Default|Database Match one of the options
).* Close negative lookahead and match any char except a newline 0+ times
)* Close non capturing group and repeat 0+ times
\r?\n(?:Default|Database)= Match newline, any of the options and =
(\S+) Capturing group 2, match 1+ times a non whitespace char (or use (.+) to match any char 1+ times)
regexstorm demo