Regex for "replace all after first 2 digits after comma" - regex

I would like to replace all characters after the first 2 digits after a comma.
E.g. having a string of 1234,56789 should result into 1234,56.
Using [^,]*$ has led me to the right path, but deleting everything after the comma.
A [^,]..$ doesnt give me a correct result too, thus I need a way to tell my expression that "the first 2 digits after the comma" got to be deleted, not "the last 2 digits" since thats what the ".." seems to do in my expression.

You can use
(,\d{2}).*
The regex matches and captures into Group 1 a comma and two digits, and just matches the rest of the line with .*.
To remove only after last comma:
(.*,\d{2}).*
Here, .* at the start captures also everything at the start of the string.
A more retrictive pattern will be
^(\d+,\d{2})\d*$
It matches start of string (with ^), then one or more digits (with \d+), a comma, two digits, all captured into Group 1, and then just matches zero or more digits (with \d*) at the end of the string ($).
Replace with $1 (or \1 depending on the regex engine). See the regex demo (also this one and this one, too).

You can use:
import re
re.sub(r',(\d{2}).*', r',\1', a)

Related

Regex Help required for User-Agent Matching

Have used an online regex learning site (regexr) and created something that works but with my very limited experience with regex creation, I could do with some help/advice.
In IIS10 logs, there is a list for time, date... but I am only interested in the cs(User-Agent) field.
My Regex:
(scan\-\d+)(?:\w)+\.shadowserver\.org
which matches these:
scan-02.shadowserver.org
scan-15n.shadowserver.org
scan-42o.shadowserver.org
scan-42j.shadowserver.org
scan-42b.shadowserver.org
scan-47m.shadowserver.org
scan-47a.shadowserver.org
scan-47c.shadowserver.org
scan-42a.shadowserver.org
scan-42n.shadowserver.org
scan-42o.shadowserver.org
but what I would like it to do is:
Match a single number with the option of capturing more than one: scan-2 or scan-02 with an optional letter: scan-2j or scan-02f
Append the rest of the User Agent: .shadowserver.org to the regex.
I will then add it to an existing URL Rewrite rule (as a condition) to abort the request.
Any advice/help would be very much appreciated
Tried:
To write a regex for IIS10 to block requests from a certain user-agent
Expected:
It to work on single numbers as well as double/triple numbers with or without a letter.
(scan\-\d+)(?:\w)+\.shadowserver\.org
Input Text:
scan-2.shadowserver.org
scan-02.shadowserver.org
scan-2j.shadowserver.org
scan-02j.shadowserver.org
scan-17w.shadowserver.org
scan-101p.shadowserver.org
UPDATE:
I eventually came up with this:
scan\-[0-9]+[a-z]{0,1}\.shadowserver\.org
This is explanation of your regex pattern if you only want the solution, then go directly to the end.
(scan\-\d+)(?:\w)+
(scan\-\d+) Group1: match the word scan followed by a literal -, you escaped the hyphen with a \, but if you keep it without escaping it also means a literal - in this case, so you don't have to escape it here, the - followed by \d+ which means one more digit from 0-9 there must be at least one digit, then the value inside the group will be saved inside the first capturing group.
(?:\w)+ non-capturing group, \w one character which is equal to [A-Za-z0-9_], but the the plus + sign after the non-capturing group (?:\w)+, means match the whole group one or more times, the group contains only \w which means it will match one or more word character, note the non-capturing group here is redundant and we can use \w+ directly in this case.
Taking two examples:
The first example: scan-02.shadowserver.org
(scan\-\d+)(?:\w)+
scan will match the word scan in scan-02 and the \- will match the hyphen after scan scan-, the \d+ which means match one or more digit at first it will match the 02 after scan- and the value would be scan-02, then the (?:\w)+ part, the plus + means match one or more word character, at least match one, it will try to match the period . but it will fail, because the period . is not a word character, at this point, do you think it is over ? No , the regex engine will return back to the previous \d+, and this time it will only match the 0 in scan-02, and the value scan-0 will be saved inside the first capturing group, then the (?:\w)+ part will match the 2 in scan-02, but why the engine returns back to \d+ ? this is because you used the + sign after \d+, (?:\w)+ which means match at least one digit, and one word character respectively, so it will try to do what it is asked to do literally.
The second example: scan-2.shadowserver.org
(scan\-\d+)(?:\w)+
(scan\-\d+) will match scan-2, (?:\w)+ will try to match the period after scan-2 but it fails and this is the important point here, then it will go back to the beginning of the string scan-2.shadowserver.org and try to match (scan\-\d+) again but starting from the character c in the string , so s in (scan\-\d+) faild to match c, and it will continue trying, at the end it will fail.
Simple solution:
(scan-\d+[a-z]?)\.shadowserver\.org
Explanation
(scan-\d+[a-z]?), Group1: will capture the word scan, followed by a literal -, followed by \d+ one or more digits, followed by an optional small letter [a-z]? the ? make the [a-z] part optional, if not used, then the [a-z] means that there must be only one small letter.
See regex demo

How to extract a word that could possibly be followed with another word

I want to extract [games, games, things, things] from
the following array.
Today_games
Today_games_freq
Today_things
Today_things_freq
I have tried Today_(\w+)(?=_freq)?
Which will give me the extra "freq"
And some other combinations, but I couldn't figure out how to get just after the first hyphen.
You can use
Today_(\w+?)(?:_freq)?$
See the regex demo. This matches Today_, then captures any one or more word chars (as few as possible) into Group 1 (with (\w+?)), and then (?:_freq)?$ matches an optional occurrence of a _freq substring and asserts the position at the end of string.
Or,
Today_([^\W_]+)
See this regex demo.
Here, Today_ is matched and the ([^\W_]+) pattern captures one or more alphanumeric chars into Group 1 (same as \w+ with _ subtracted from \w).

regex match two words based on a matching substring

there are 4 strings as shown below
ABC_FIXED_20220720_VALUEABC.csv
ABC_FIXED_20220720_VALUEABCQUERY_answer.csv
ABC_FIXED_20220720_VALUEDEF.csv
ABC_FIXED_20220720_VALUEDEFQUERY_answer.csv
Two strings are considered as matched based on a matching substring value (VALUEABC, VALUEDEF in the above shown strings). Thus I am looking to match first 2 (having VALUEABC) and then next 2 (having VALUEDEF). The matched strings are identified based on the same value returned for one regex group.
What I tried so far
ABC.*[0-9]{8}_(.*[^QUERY_answer])(?:QUERY_answer)?.csv
This returns regex group-1 (from (.*[^QUERY_answer])) value "VALUEABC" for first 2 strings and "VALUEDEF" for next 2 strings and thus desired matching achieved.
But the problem with above regex is that as soon as the value ends with any of the characters of "QUERY_answer", the regex doesn't match any value for the grouping. For instance, the below 2 strings doesn't match at all as the VALUESTU ends with "U" here :
ABC_FIXED_20220720_VALUESTU.csv
ABC_FIXED_20220720_VALUESTUQUERY_answer.csv
I tried to use Negative Lookahead:
ABC.*[0-9]{8}_(.*(?!QUERY_answer))(?:QUERY_answer)?.csv
but in this case the grouping-1 value is returned as "VALUESTU" for first string and "VALUESTUQUERY_answer" for second string, thus effectively making the 2 strings unmatched.
Any way to achieve the desired matching?
With your shown samples please try following regex.
^ABC_[^_]*_[0-9]+_(.*?)(?:QUERY_answer)?\.csv$
OR to match exact 8 digits try:
^ABC_[^_]*_[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv$
Here is the online demo for above regex.
Explanation: Adding detailed explanation for above regex.
^ABC_[^_]*_ ##Matching from starting of value ABC followed by _ till next occurrence of _.
[0-9]+_ ##Matching continuous occurrences of digits followed by _ here.
(.*?) ##Creating one and only capturing group using lazy match which is opposite of greedy match.
(?:QUERY_answer)? ##In a non-capturing group matching QUERY_answer and keeping it optional.
\.csv$ ##Matching dot literal csv at the end of the value.
You need
ABC.*[0-9]{8}_(.*?)(?:QUERY_answer)?\.csv
See the regex demo.
Note
.*[^QUERY_answer] matches any zero or more chars other than line break chars as many as possible, and then any one char other than Q, U, E, etc., i.e. any char in the negated character class. This is replaced with .*?, to match any zero or more chars other than line break chars as few as possible.
(?:QUERY_answer)? - the group is made non-capturing to reduce grouping complexity.
\.csv - the . is escaped to match a literal dot.

Trying to match zero outside the word bounderies

I have patterns like
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
I can match word TCELL and TBNK with this RegEX
^(\D+)-(\d+)-(\d+)([A-Z1-9]+)?.*
But if I have patterns like
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
the above regex returns
T2 and C192 instead of T20NK and C1920 respectively
Is there a general regex that matches Nzeros out side of these word boundaries?
Let's consider all 4 examples of your input:
FQC19515_TCELL001_20190319_165944.pdf
FQC19515_TBNK001_20190319_165944.pdf
FLW194640_T20NK022_20190323_131348.pdf
FLW194228_C1920_SOME_DEBRIS_REMOVED.pdf
The first group, between start of line and the first "_" (e.g. FQC19515 in row 1)
consists of:
a non-empty sequence of letters,
a non-empty sequence of digits.
So the regex matching it, including the start of line anchor and a capturing group is:
^([A-Z]+\d+)
You used \D instead of [A-Z] but I think that [A-Z] is
more specific, as it matches only letters an not e.g. "_".
The next source char is _, so the regex can also include _.
A now the more diificult part: The second group to be captured has
actually 2 variants:
a sequence of letters and a sequence of digits (after that there is
a "_"),
a sequence of letters, a sequence of digits and another sequence of
letters (after that there are digits that you want to omit).
So the most intuitive way is to define 2 alternatives, each with
a respective positive lookahead:
alternative 1: [A-Z]+\d+(?=_),
alternative 2: [A-Z]+\d+[A-Z]+(?=\d).
But there is a bit shorter way. Notice that both alternatives start
from [A-Z]+\d+.
So we can put this fragment at the first place and only the rest
include as a non-capturing group ((?:...)), with 2 alternatives.
All the above should be surrounded with a capturing group:
([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
So the whole regex can be:
^([A-Z]+\d+)_([A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
with m option ("^" matches also the start of each line).
For a working example see https://regex101.com/r/GDdt10/1
Your regex: ^(\D+)-(\d+) is wrong as after a sequence of non-digits
(\D+) you specified a minus which doesn't occur in your source.
Also the second minus does not correspond to your input.
Edit
To match all your strings, I modified slightly the previous regex.
The changes are limited to the matching group No 2 (after _):
Alternative No 1: [A-Z]{2,}+(?=\d) - two or more letters, after them
there is a digit, to be omitted. It will match TCELL and TBNK.
Alternative No 2: [A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)) - the previous
content of this group. It will match two remaining cases.
So the whole regex is:
^([A-Z]+\d+)_([A-Z]{2,}+(?=\d)|[A-Z]+\d+(?:(?=_)|[A-Z]+(?=\d)))
For a working example see https://regex101.com/r/GDdt10/2
As far as I understand, you could use:
^[A-Z]+\d+_\K[A-Z0-9]{5}
Explanation:
^ # beginning of line
[A-Z]+ # 1 or more capitals
\d+_ # 1 or more digit and 1 underscore
\K # forget all we have seen until this position
[A-Z0-9]{5} # 5 capitals or digits
Demo

changing RegEx from 3 digits to 4

I'm not that great at RegEx, and have the following piece of code on my hands:
value.replace(/\s*.*(\d+[,\.]\d+)[^\d]*/m, "$1");
Now it works great at reducing this "\r\n\t\t\t\t& #36;0.05 USD\t\t\t" (please note I've intentionally left a space between the & and # as removing it converts it to a dollar sign on the site) to this "0.05". The issue I have is that if the number is a double digit (10.05 rather than 0.05) the expression removes the digit from the front and still outputs 0.05 rather than 10.05.
From what I can see in the expression, it's hard coded to pick up just 3 digits, so I was wondering if there's a way to amend it to also work in cases where there are 4 digits.
The . after /\s* is matching the first digit if there are 2 or more digits. Remove that and see if it works...
value.replace(/\s*(\d+[,.]\d+)[^\d]/m, "$1");
Given your example of the regex:
/\s*.*(\d+[,.]\d+)[^\d]/m
And the data:
\r\n\t\t\t\t$0.05 USD\t\t\t
\r\n\t\t\t\t$10.05 USD\t\t\t
In the regex, the leading "/" (forward-slash), and the "/" before the "m" delimits the regex and is not part of the matching.
The "\s" in the regex is shorthand for [ \t\r\n\f] which matches whitespace (space, tab, Carriage-return, Line-feed, Form-feed). So, "\s*" will match "\r\n\t\t\t\t"
The "." (dot) in the regex matches any single character (generally any character except "\n").
The "*" following the "." says to match any 0 or more characters. So, together the ".*", matches the "$" (and possibly, additionally, one or more digits... see below).
Next, the "(" in the regex starts the part of the regex that will "capture" part of your data.
The "\d" in the regex will match any 1 number. Actually "\d" matches [0-9] and other digit characters, like Eastern Arabic numerals "??????????".
The "+" following the "\d" says to match any 1 or more numbers (digits).
The "[,.]" in the regex will match one of either a literal "." (dot), or a "," (comma), to match the "decimal" separator.
Another "\d+" to match any 1 or more numbers (digits).
Next, the ")" in the regex closes the part of the regex that will "capture" part of your data.
The "[^\d]" will match any 1 character that is not a number (digit). So, in this case, it will match the
" " (space).
The "m" at the end of the regex (following the second "/"): "m" changes the behavior of the "^" and "$" anchors, which are not used in your regex, so the "m" should have no effect. But, if you're using Ruby, "m" changes the behavior of the "." (dot).
Now, the "problem"... the ".*" (before the "("), is in regex terms, "greedy". This means it will match as "early" as possible, and for as "long" as possible. So, if there is more than 1 digit following the ";", then the ".*" will consume some digits.
Note: Using ".*" can cause all sorts of problems, especially with "/m" under Ruby. It's best to avoid using ".*" if possible.
There are 2 ways to fix this.
1) If the part before the number you want to capture is always "$", then specify that in regex instead of the ".*". So like this:
/\s*$(\d+[,.]\d+)[^\d]/m
or, if it will always be "$" or something very similar to that:
/\s*[^;]+;(\d+[,.]\d+)[^\d]/m
Here, "[^;]+;" means any string of 1 or more characters that does not contain a ";" followed by a "[;]".
2) If the part before the number you want to capture which is shown as "$", could be totally different in the data, then you just need to make sure that the part of the regex that is currently ".*" will not match a digit in the last position. So like this:
/\s[^.,]*[^\d](\d+[,.]\d+)[^\d]/m
Here, "[^.,]*[^\d]" means any string of 0 or more characters that does not contain a "." (dot) or a "," (comma) where the last character does not contain a digit.
Try this
value.replace( /\s*.(\d+[,.]\d+)[^\d]/m, "$1");
WORKING REGEX
Output:
The .* matches greedily and therefore matches as many characters, including digits, as it can, as long as the rest of the pattern can still match.
The rest of the pattern can still match if just one digit is left for the /d+ to match, so you only end up with one digit there.
If the semicolon in your example is always in that position in the strings you wish to match, use it as a marker like this
value.replace(/.*;(\d+[,\.]\d+).*/m, "$1");