I am try to split this String into key value pattern using regex
val x = "title=MyTitle, active=true, title2=MyTitle, Subtitle, new=false, title3=My Title#subtitle1"
I try using this formula
([\w]+)=(.*?)([\w]+)
the output is
title=MyTitle
active=true
title2=MyTitle
new=false
title3=My
Any clue to modified regex formula so the output become
title=MyTitle
active=true
title2=MyTitle, Subtitle
new=false
title3=My Title#subtitle1
A look-ahead works pretty well
(\w+)=(.*?(?=,\s*\w+=|\s*$))
Breakdown
( # group 1 (key)
\w+ # word characters
) # end group 1
= # equals sign
( # group 2 (value)
.*? # anything, non-greedy
(?= # look-ahead ("followed by")...
,\s* # comma and spaces
\w+= # word characters and an equals sign (the next key)
| # or
\s*$ # spaces and the end of string
) # end look-ahead
) # end group 2
You might use 2 capturing groups.
In the first group capture 1+ word chars. In the second group capture the words asserting what is on the right is not a whitespace char followed by a word and an equals sign.
(\w+)=([\w+,]+(?:(?!\s\w+=)\s[#\w]+)*)(?:,|$)
(\w+) Capture group 1, match 1+ word chars
= Match =
( Capture group 2
[\w+,]+ Match 1+ word chars or ,
(?: Non capturing group
(?!\s\w+=) Assert what is on the right is not a whitespace, 1+ word chars and =
\s[#\w]+ Match a whitespace and 1+ word chars or #
)* Close group and repeat 0+ times
) Close group 1
(?:,|$) Match either , or assert the end of the string
Regex demo
Or you might use a broader match for the value part using a negated character class instead of matching the specified characters.
(\w+)=([^=\s]+(?:(?!\s\w+=)\s[^=\s,]+)*)(?:,|$)
Regex demo
Related
I have a string that has the following structure:
digit-word(s)-digit.
For example:
2029 AG.IZTAPALAPA 2
I want to extract the word(s) in the middle, and the digit at the end of the string.
I want to extract AG.IZTAPALAPA and 2 in the same capture group to extract like:
AG.IZTAPALAPA 2
I managed to capture them as individual capture groups but not as a single:
town_state['municipality'] = town_state['Town'].str.extract(r'(\D+)', expand=False)
town_state['number'] = town_state['Town'].str.extract(r'(\d+)$', expand=False)
Thank you for your help!
Yo can use a single capturing group for the example string to match a single "word" that consists of uppercase chars A-Z with an optional dot in the middle which can not be at the start or end followed by 1 or more digits.
\b\d+ ([A-Z]+(?:\.[A-Z]+)* \d+)\b
Explanation
\b A word boundary
\d+
( Capture group 1
[A-Z]+ Match 1+ occurrences of an uppercase char A-Z
(?:\.[A-Z]+)* \d+ Repeat 0+ times matching a dot and a char A-Z followed by matching 1+ digits
) Close group 1
\b A word boundary
Regex demo
Or you can make the pattern a bit broader matching either a dot or a word character
\b\d+ ([\w.]+(?: [\w.]+)* \d+)\b
Regex demo
You can use the following simple regex:
[0-9]+\s([A-Z]+.[A-Z]+(?: [0-9]+)*)
Note:
(?: [0-9]+)* will make it the last digital optional.
I have a set of strings with fairly inconsistent naming, that should be structured enough to be divided into groups though.
Here's an excerpt:
test test 1970-2020 w15.txt
test 1970-2020 w15.csv
test 1990-99 q1 .txt
test 1981 w15 .csv
test test w15.csv
I am trying to extract information by groups (test-name, (year)?, suffix, type) using the following RegEx:
(.*)\s+([0-9]+(\-[0-9]+)?\s+)?((w|q)[0-9]+(\s+)?)(\..*)$
It works except for the optional group matching the years (interval of year's, single year or no year at all).
What am I missing to make the pattern work?
Here's also a link to RegEx101 for testing:
https://regex101.com/r/wG3aM3/817
You could make the pattern a bit more specific and make the content of the year optional
^(.*?)\s+((?:\d{4}(?:-(?:\d{4}|\d{2}))?)?)\s+([wq][0-9]+)\s*(\.\w+)$
Explanation
^ Start of string
(.*?) Capture group 1 Match 0+ times any char except a newline non greedy
\s+ Match 1+ whitespace chars
( Capture group 2
(?: Non capture group
\d{4}(?:-(?:\d{4}|\d{2}))? Match 4 digits and optionally - and 2 or 4 digits
)? Close non capture group and make the year optional
) Close group 2
\s+ Match 1+ whitespace chars
([wq][0-9]+) Capture group 3 Match either w or q and 1+ digits 0-9
\s* Match 0+ whitespace chars
(\.\w+) Capture group 4, match a dot and 1+ word characters
$ End of string
Regex demo
Note that \s could also match a newline.
I'm struggling with that one. I want to capture the content of parenthesis where there isn't only digit %. This means I would want to capture this (essiccato, ricco di flavonoidi) or (ricco di 23% pollo, in parte essiccato, in parte idrolizzato) but not this (23 %)or (23)or (23 %)
Here is an exemple : https://regex101.com/r/yW4aZ3/896
So far I'm there : \([^()][^()]*\)
You may use
r'\((?!\s*\d+(?:[.,]\d+)?\s*)[^()]+\)'
See the regex demo and the regex graph:
Details
\( - a ( char
(?!\s*\d+(?:[.,]\d+)?\s*) - a negative lookahead that matches a location not immediately followed with
\s* - 0+ whitespaces
\d+ - 1+ digits
(?:[.,]\d+)? - an optional occurrence of . or , and 1+ digits
\s* - 0+ whitespaces
[^()]+ - 1+ chars other than ( and )
\) - a ) char.
You might use a negative lookahead what follows after the opening parenthesis is not digits followed by an optional percentage sign:
\((?!\s*\d+\s*%?\s*\))[^)]+\)
Explanation
\( Match (
(?! Negative lookahead, assert what is on the right is not
\s*\d+\s*%?\s*\) match 1+ digits followed by an optional % till )
) Close lookahead
[^)]+\) Match 1+ times any char except ), then match )
Regex demo
Assuming that (...) are all balanced and there is no escaping of parentheses inside, you may use this regex with a character class and 2 negated character classes:
\([\d%]*[^%\d()][^()]*\)
Updated RegEx Demo
RegEx Details
\(: Match opening (
[\d%]*: Match 0 or more of any characters that is either a digit or %
[^%\d()]: Match a character that is not (, ), % and a digit
[^()]*: Match 0 or more of any characters that are not ( and not a )
\): Match closing )
I have to collect two informantion from a text using regex. The name and the database and relate then in one table. But a can only collect then individually.
This is an example, i have many blocks of these, and two of then don't have a database value, these i need to ingnore
[SCD] {I need the name between []}
Driver=/opt/pcenter/pc961/ODBC7.1/lib/DWmsss27.so
Description=
Database=scd {I need the value after Defaut|Database}
Address=#######
LogonID=######
Password=######
QuoteId=No
AnsiNPW=No
ApplicationsUsingThreads=1
The regex to find the name is:
(?<=\[)(.*)(?=\])
The regex to find the value after database is
(?<=Defaut|Database=)(.*)
How can i combine both of then into onde regex ?
To match both values you could use 2 capturing groups instead and use a repeating pattern and a negative lookahead to check if a line do not start with Default of Database until the line does.
\[([^]]+)\](?:\r?\n(?!Default|Database).*)*\r?\n(?:Default|Database)=(\S+)
About the pattern
\[ Match [
( Capture group 1
[^]]+ match 1+ times not ]
) Close group 1
\] Match ]
(?: Non capturing group
\r?\n Match newline,
(?! Negative lookahead, assert what is directly on the right is not
Default|Database Match one of the options
).* Close negative lookahead and match any char except a newline 0+ times
)* Close non capturing group and repeat 0+ times
\r?\n(?:Default|Database)= Match newline, any of the options and =
(\S+) Capturing group 2, match 1+ times a non whitespace char (or use (.+) to match any char 1+ times)
regexstorm demo
I want to get two words from a string, one before "__" and the other after.
To be an example:
'?o=-1' # Skip it
'?client__name=Client1&o=-1' # Should return client__name
'?o=-1&product__name=Product1+Test1' # Should return product__name
The nearest I was:
after: (?:__).*[a-z]
before: (\S+?)__
I'm trying to use it in python
You may use
[^\W_]+__[^\W_]+
See the regex demo and its graph:
Details
[^\W_]+ - 1 or more letters or digits
__ - a __ substring
[^\W_]+ - 1 or more letters or digits.
You might use a capturing group:
[&?]([^\W_]+__[^\W_]+)=
That will match:
[&?] Match either & or ?
( Capturing group
[^\W_]+__[^\W_]+ Match 1+ times a word char except an underscore, then __ and again 1+ times a word char except an underscore
) Close capturing group
= Match literally
Regex demo