I try to seperate any string into 2 groups, digits and chars and eliminate all whitespace between this 2 groups. And after the first digit chars are allowed.
The (\D*)(\S+) works so far well for me except for the whitespace after the 1 group of chars.
Here is my regex demo.
You could exclude matching the whitespace chars as well using a negated character class [^\d\s]+ matching 1+ times any char except a whitespace char or a digit.
You can match optional whitespace chars using \s*
([^\d\s]+)\s*(\S+)
Explanation
( Capture group 1
[^\d\s]+ Match 1+ chars except a digit or whitespace char
) Close group
\s* Match 0+ non whitespace chars
(\S+) Capture group 2, match 1+ times a non whitespace char
Regex demo
Related
I've stumbled with certain types of rows.
I parse this information
195/75 R 16 C X Wonder Van 110/108R 10PR Tourador
The groups, which I need
I've got the following regex
([0-9]+)?\/([0-9]+)\s*\w\s*([0-9]+(?:\.\d+)?)\s*(C\s+)?(.+\s+?(?=[0-9]{2,3}|(\d{2,3}\/\d{2,3})))(?:(\d{2,3}\/\d{2,3})|(\d{2,3}))\s*(\w)(.*)
It works nicely for all kinds od rows, e.g
225/55 R18 X Speed TU1 98V Toradfor
225/50 R 16 X Wonder TH1 96W XL Tourador
195/75 R 16 C X Wonder Van 110/108R 8PR Tourador
However, it doesn't work for
195/75 R 16 C X Wonder Van 110/108R 10PR Tourador
because of 10PR, where 10 consists of 2 digits
how it works now
Thank you!
In you pattern you use alternations | that can match and capture unrelated parts in the strings.
What you could do is use anchors and and an optional capture group
For all the given example strings you might use:
^(\d+)\/(\d+)\s+[A-Z]*\s*(\d+)\s*([A-Z])(.*?)(\d+\/\d+([A-Z]+))?\s+(\d+[A-Z]+\s+.*)$
The pattern in parts:
^ Start of string
(\d+)\/(\d+)\s+ Capture 2 times 1+ digits in a group
[A-Z]*\s* Match optional chars A-Z and optional whitspace chars
(\d+)\s* Capture 1+ digits in a group and match optional whitespace chars
([A-Z]) Capture a single char A-Z in a group
(.*?) Capture as few as possible chars in a group
( Capture group
\d+\/\d+ Match 1+ digits / and 1+ digits
([A-Z]+) Capture 1+ chars A-Z
)? Close the capture group and make it optional
\s+ Match 1+ whitespace chars
(\d+[A-Z]+\s+.*) Capture group, match 1+ digits, 1+ chars A-Z, 1+ whitespce chars and the rest of the line
$ End of string
Regex demo
I have a string that has the following structure:
digit-word(s)-digit.
For example:
2029 AG.IZTAPALAPA 2
I want to extract the word(s) in the middle, and the digit at the end of the string.
I want to extract AG.IZTAPALAPA and 2 in the same capture group to extract like:
AG.IZTAPALAPA 2
I managed to capture them as individual capture groups but not as a single:
town_state['municipality'] = town_state['Town'].str.extract(r'(\D+)', expand=False)
town_state['number'] = town_state['Town'].str.extract(r'(\d+)$', expand=False)
Thank you for your help!
Yo can use a single capturing group for the example string to match a single "word" that consists of uppercase chars A-Z with an optional dot in the middle which can not be at the start or end followed by 1 or more digits.
\b\d+ ([A-Z]+(?:\.[A-Z]+)* \d+)\b
Explanation
\b A word boundary
\d+
( Capture group 1
[A-Z]+ Match 1+ occurrences of an uppercase char A-Z
(?:\.[A-Z]+)* \d+ Repeat 0+ times matching a dot and a char A-Z followed by matching 1+ digits
) Close group 1
\b A word boundary
Regex demo
Or you can make the pattern a bit broader matching either a dot or a word character
\b\d+ ([\w.]+(?: [\w.]+)* \d+)\b
Regex demo
You can use the following simple regex:
[0-9]+\s([A-Z]+.[A-Z]+(?: [0-9]+)*)
Note:
(?: [0-9]+)* will make it the last digital optional.
I have a set of strings with fairly inconsistent naming, that should be structured enough to be divided into groups though.
Here's an excerpt:
test test 1970-2020 w15.txt
test 1970-2020 w15.csv
test 1990-99 q1 .txt
test 1981 w15 .csv
test test w15.csv
I am trying to extract information by groups (test-name, (year)?, suffix, type) using the following RegEx:
(.*)\s+([0-9]+(\-[0-9]+)?\s+)?((w|q)[0-9]+(\s+)?)(\..*)$
It works except for the optional group matching the years (interval of year's, single year or no year at all).
What am I missing to make the pattern work?
Here's also a link to RegEx101 for testing:
https://regex101.com/r/wG3aM3/817
You could make the pattern a bit more specific and make the content of the year optional
^(.*?)\s+((?:\d{4}(?:-(?:\d{4}|\d{2}))?)?)\s+([wq][0-9]+)\s*(\.\w+)$
Explanation
^ Start of string
(.*?) Capture group 1 Match 0+ times any char except a newline non greedy
\s+ Match 1+ whitespace chars
( Capture group 2
(?: Non capture group
\d{4}(?:-(?:\d{4}|\d{2}))? Match 4 digits and optionally - and 2 or 4 digits
)? Close non capture group and make the year optional
) Close group 2
\s+ Match 1+ whitespace chars
([wq][0-9]+) Capture group 3 Match either w or q and 1+ digits 0-9
\s* Match 0+ whitespace chars
(\.\w+) Capture group 4, match a dot and 1+ word characters
$ End of string
Regex demo
Note that \s could also match a newline.
Task:
MATCH:
3.45
5,4
.45
3e4
,54
4
4.
4,
DON'T MATCH:
4,5e
2e
.3.
2e,4
,4.
d34
2.45t
2,45.
Currently i came up with the following:
(?<=\s|^)[-+]?(?:(?:[.,]?\d+[.,]?\d*[eE]\d+(?!\w|[.,]))|[.,]?\d+[.,]?\d*(?!\w|[.,]))\b
That works for almost everything, except 2 last numbers (4. and 4,) and got stucked
You may use
(?<!\S)[-+]?[0-9]*(?:[.,]?[0-9]+(?:[eE][-+]?[0-9]+)?|(?<=\d)[,.])(?!\S)
See the regex demo
Details
(?<!\S) - start of string or a whitespace must appear immediately to the left
[-+]? - an optional + or -
[0-9]* - 0+ digits
(?:[.,]?[0-9]+(?:[eE][-+]?[0-9]+)?|[,.]) - either
[.,]?[0-9]+(?:[eE][-+]?[0-9]+)? - an optional . or ,, then 1+ digits, then an optional sequence of e or E, followed with an optional . or , and 1+ digits
| - or
(?<=\d)[,.] - a dot or comma only if preceded with a digit (to avoid matching standalone . or ,)
(?!\S) - end of string or a whitespace must appear immediately to the right.
Regex graph:
You could use an alternation to match 1+ digits followed by a dot or comma and 0+ digits or match the Ee part followed by 1+ digits.
Or match starting with a dot or comma followed by 1+ digits.
If this is the only thing to match on the line, you could use anchors ^ and $ or use lookarounds to assert that there are no non whitespace chars on the left and right.
(?<!\S)(?:\d+(?:[.,]\d*|[eE]\d+)?|[.,]\d+)(?!\S)
Pattern parts
(?<!\S) Assert what is directly to the left is non a non whitespace char
(?: Non capturing group
\d+ Match 1+ digits
(?: Non capturing group
[.,]\d* Match either . or , and 0+ digits
| Or
[eE]\d+ Match e or E and 1+ digits
)? Close group and make it optional
| Or
[.,]\d+ Match . or , and 1+ digits
) Close group
(?!\S) Assert what is directly to the right is non a non whitespace char
Regex demo
So I currently have a regex (https://regex101.com/r/zBE4Ju/1) that highlights the words before and after a linebreak. This is nice, but the issue is sometimes there are whitespaces after the word that appears BEFORE the line break. So they end up
You can see on my regex101 how the issue happens, and I have outlined the problem. I need to recognize the word before and after the line break, regardless of if there is a space after the word.
(\w*(?:[\n](?![\n])\w*)+)
You can see it in action here https://regex101.com/r/zBE4Ju/3
Expected: Line 1
Actual: Line 3
You can use $1 from:
/([^ ]+) *(\r|\n)/gm
https://regex101.com/r/o87VP7/5
If you want to highlight the last "word" in the sentence followed by possible spaces and a newline, you could repeat 0+ times a group matching 1+ non whitespace chars followed by 1+ spaces.
Then capture in a group matching non whitespace chars (\S+) and match possible spaces followed by a newline.
^ *(?:\S+ +)*(\S+) *\r?\n
Explanation
^ Start of string
* Match 0+ times a space
(?: Non capturing group
\S+ + Match 1+ non whitespace chars and 1+ spaces
-)* Close non capturing group and repeat 0+ times (to also match a single word at the beginning)
(\S+) Capture group 1, match 1+ times a non whitespace char
*\r?\n Match 0+ times a space followed by a newline
Regex demo