I have 2 strings
1) abc-def
2) abc-
and i have written regex group (?<Myid>[a-zA-Z0-9-]+) all works fine for the first string
However in 2nd string i don't need "-", only abc should be selected. How can i add condition here.
I would phrase your regex as:
(?<Myid>[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*)
This pattern says to match:
[a-zA-Z0-9]+ match one or more alphanumeric characters
(?:-[a-zA-Z0-9]+)* followed by dash and more alphanumeric characters,
zero or more times
Demo
Just appending the negation rule at the end will suffice here I guess.
i.e. (?<Myid>[a-zA-Z0-9-]+[^-])
Demo: https://regex101.com/r/PetK6Q/1
Related
there are 4 strings as shown below
ABC_FIXED_20220720_VALUEABC.csv
ABC_FIXED_20220720_VALUEABCQUERY_answer.csv
ABC_FIXED_20220720_VALUEDEF.csv
ABC_FIXED_20220720_VALUEDEFQUERY_answer.csv
Two strings are considered as matched based on a matching substring value (VALUEABC, VALUEDEF in the above shown strings). Thus I am looking to match first 2 (having VALUEABC) and then next 2 (having VALUEDEF). The matched strings are identified based on the same value returned for one regex group.
What I tried so far
ABC.*[0-9]{8}_(.*[^QUERY_answer])(?:QUERY_answer)?.csv
This returns regex group-1 (from (.*[^QUERY_answer])) value "VALUEABC" for first 2 strings and "VALUEDEF" for next 2 strings and thus desired matching achieved.
But the problem with above regex is that as soon as the value ends with any of the characters of "QUERY_answer", the regex doesn't match any value for the grouping. For instance, the below 2 strings doesn't match at all as the VALUESTU ends with "U" here :
ABC_FIXED_20220720_VALUESTU.csv
ABC_FIXED_20220720_VALUESTUQUERY_answer.csv
I tried to use Negative Lookahead:
ABC.*[0-9]{8}_(.*(?!QUERY_answer))(?:QUERY_answer)?.csv
but in this case the grouping-1 value is returned as "VALUESTU" for first string and "VALUESTUQUERY_answer" for second string, thus effectively making the 2 strings unmatched.
Any way to achieve the desired matching?
With your shown samples please try following regex.
^ABC_[^_]*_[0-9]+_(.*?)(?:QUERY_answer)?\.csv$
OR to match exact 8 digits try:
^ABC_[^_]*_[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv$
Here is the online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^ABC_[^_]*_ ##Matching from starting of value ABC followed by _ till next occurrence of _.
[0-9]+_ ##Matching continuous occurrences of digits followed by _ here.
(.*?) ##Creating one and only capturing group using lazy match which is opposite of greedy match.
(?:QUERY_answer)? ##In a non-capturing group matching QUERY_answer and keeping it optional.
\.csv$ ##Matching dot literal csv at the end of the value.
You need
ABC.*[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv
See the regex demo.
Note
.*[^QUERY_answer] matches any zero or more chars other than line break chars as many as possible, and then any one char other than Q, U, E, etc., i.e. any char in the negated character class. This is replaced with .*?, to match any zero or more chars other than line break chars as few as possible.
(?:QUERY_answer)? - the group is made non-capturing to reduce grouping complexity.
\.csv - the . is escaped to match a literal dot.
I'm trying to search for colons in a given string so as to split the string at the colon for preprocessing based on the following conditions
Preceeded or followed by a word e.g A Book: Chapter 1 or A Book :Chapter 1
Do not match if it is part of emoticons i.e :( or ): or :/ or :-) etc
Do not match if it is part of a given time i.e 16:00 etc
I've come up with a regex as such
(\:)(?=\w)|(?<=\w)(\:)
which satisfies conditions 2 & 3 but still fails on condition 3 as it matches the colon present in the string representation of time. How do I fix this?
edit: it has to be in a single regex statement if possible
You can use
(:\b|\b:)(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b)
See the regex demo. Details:
(:\b|\b:) - Group 1: a : that is either preceded or followed with a word char
(?!(?:(?<=\b\d:)|(?<=\b\d{2}:))\d{1,2}\b) - there should be no one or two digits right after : (followed with a word boundary) if the : is preceded with a single or two digits (preceded with a word boundary).
Note :\b is equal to :(?=\w) and \b: is equal to (?<=\w):.
If you need to get the same capturing groups as in your original pattern, replace (:\b|\b:) with (?:(:)\b|\b(:)).
More flexible solution
Note that excluding matches can be done with a simpler pattern that matches and captures what you need and just matches what you do not need. This is called "best regex trick ever". So, you may use a regex like
8:|:[PD]|\d+(?::\d+)+|(:\b|\b:)
that will match 8:, :P, :D, one or more digits and then one or more sequences of : and one or more digits, or will match and capture into Group 1 a : char that is either preceded or followed with a word char. All you need to do is to check if Group 1 matched, and implement required extraction/replacement logic in the code.
Word characters \w include numbers [a-zA-Z0-9_]
So just use [a-ZA-Z] instead
(\:)(?=[a-zA-Z])|(?<=[a-zA-Z])(\:)
Test Here
I would like to replace all characters after the first 2 digits after a comma.
E.g. having a string of 1234,56789 should result into 1234,56.
Using [^,]*$ has led me to the right path, but deleting everything after the comma.
A [^,]..$ doesnt give me a correct result too, thus I need a way to tell my expression that "the first 2 digits after the comma" got to be deleted, not "the last 2 digits" since thats what the ".." seems to do in my expression.
You can use
(,\d{2}).*
The regex matches and captures into Group 1 a comma and two digits, and just matches the rest of the line with .*.
To remove only after last comma:
(.*,\d{2}).*
Here, .* at the start captures also everything at the start of the string.
A more retrictive pattern will be
^(\d+,\d{2})\d*$
It matches start of string (with ^), then one or more digits (with \d+), a comma, two digits, all captured into Group 1, and then just matches zero or more digits (with \d*) at the end of the string ($).
Replace with $1 (or \1 depending on the regex engine). See the regex demo (also this one and this one, too).
You can use:
import re
re.sub(r',(\d{2}).*', r',\1', a)
I have strings of the following form:
en-US //return en
en-UK //return en
en //don't return
nl-NL //return nl
nl-BE //return nl
nl //don't return
I would like to return the one's that are indicated in the code above. I tried .*\- but this returns en-. How do I stop returning before the slash? So only return en? I'm testing it here.
One option is to use a capturing group at the start of the string for the first 2 lowercase chars and then match the following dash and the 2 uppercase chars.
^([a-z]{2})-[A-Z]{2}$
Regex demo
If you want to capture multiple chars [a-z] (or any character except a hypen or newline [^-\r\n]) before the dash and then match it you could use a quantifier like + to match 1+ times or use {2,} to match 2 or more times.
^([a-z]{2,})-
Regex demo
You could use a positive lookahead.
.*(?=-)
If you are always specifically looking for 2 lowercase alpha characters preceeding a dash, then it is probably a good idea to be a bit more targeted with your regex.
[a-z]{2}(?=-)
.+?(?=-) as a regular expression should do what you are asking.
Where
. matches any character
+? matches between one and infinity times, but it does it as few times as possible, using lazy expansion
and
(?=-) Is a positive look ahead, so it checks ahead in the string, and only matches and returns if the next character in the string is - but the return will not include the value -
For example, I have the following string:
-----ABC-----
And use
regexp {^[\-]+.*[\-]+} $string
can match the above string. But if I want to match fixed number of -, e.g. 5 times, how to do that? I tried
regexp {^[\-]{5}.*[\-]{5}} $string
But it doesn't work.
The .* part matches the - as well. I would change it to this:
^-{5}[^-]*-{5}$
[^-]* means "any character except a -. (you don't have to put the - in [] if it's the only allowed character)
Assuming that by does not work you mean that it also matches something like so: -----ABC--------------, you could change this: {^[\-]{5}.*[\-]{5}} to this: {^[\-]{5}[^-]*[\-]{5}$}.
The main difference is that I am specifying that mid section, that is, the section which in your example contains ABC should not be made out of dashes, so it will match 5 dashes from the beginning of the string (^[\-]{5}), followed by 0 or more characters which are not dashes ([^-]*), followed by 5 more dashes and a string termination ([\-]{5}$).