changing RegEx from 3 digits to 4 - regex

I'm not that great at RegEx, and have the following piece of code on my hands:
value.replace(/\s*.*(\d+[,\.]\d+)[^\d]*/m, "$1");
Now it works great at reducing this "\r\n\t\t\t\t& #36;0.05 USD\t\t\t" (please note I've intentionally left a space between the & and # as removing it converts it to a dollar sign on the site) to this "0.05". The issue I have is that if the number is a double digit (10.05 rather than 0.05) the expression removes the digit from the front and still outputs 0.05 rather than 10.05.
From what I can see in the expression, it's hard coded to pick up just 3 digits, so I was wondering if there's a way to amend it to also work in cases where there are 4 digits.

The . after /\s* is matching the first digit if there are 2 or more digits. Remove that and see if it works...
value.replace(/\s*(\d+[,.]\d+)[^\d]/m, "$1");

Given your example of the regex:
/\s*.*(\d+[,.]\d+)[^\d]/m
And the data:
\r\n\t\t\t\t$0.05 USD\t\t\t
\r\n\t\t\t\t$10.05 USD\t\t\t
In the regex, the leading "/" (forward-slash), and the "/" before the "m" delimits the regex and is not part of the matching.
The "\s" in the regex is shorthand for [ \t\r\n\f] which matches whitespace (space, tab, Carriage-return, Line-feed, Form-feed). So, "\s*" will match "\r\n\t\t\t\t"
The "." (dot) in the regex matches any single character (generally any character except "\n").
The "*" following the "." says to match any 0 or more characters. So, together the ".*", matches the "$" (and possibly, additionally, one or more digits... see below).
Next, the "(" in the regex starts the part of the regex that will "capture" part of your data.
The "\d" in the regex will match any 1 number. Actually "\d" matches [0-9] and other digit characters, like Eastern Arabic numerals "??????????".
The "+" following the "\d" says to match any 1 or more numbers (digits).
The "[,.]" in the regex will match one of either a literal "." (dot), or a "," (comma), to match the "decimal" separator.
Another "\d+" to match any 1 or more numbers (digits).
Next, the ")" in the regex closes the part of the regex that will "capture" part of your data.
The "[^\d]" will match any 1 character that is not a number (digit). So, in this case, it will match the
" " (space).
The "m" at the end of the regex (following the second "/"): "m" changes the behavior of the "^" and "$" anchors, which are not used in your regex, so the "m" should have no effect. But, if you're using Ruby, "m" changes the behavior of the "." (dot).
Now, the "problem"... the ".*" (before the "("), is in regex terms, "greedy". This means it will match as "early" as possible, and for as "long" as possible. So, if there is more than 1 digit following the ";", then the ".*" will consume some digits.
Note: Using ".*" can cause all sorts of problems, especially with "/m" under Ruby. It's best to avoid using ".*" if possible.
There are 2 ways to fix this.
1) If the part before the number you want to capture is always "$", then specify that in regex instead of the ".*". So like this:
/\s*$(\d+[,.]\d+)[^\d]/m
or, if it will always be "$" or something very similar to that:
/\s*[^;]+;(\d+[,.]\d+)[^\d]/m
Here, "[^;]+;" means any string of 1 or more characters that does not contain a ";" followed by a "[;]".
2) If the part before the number you want to capture which is shown as "$", could be totally different in the data, then you just need to make sure that the part of the regex that is currently ".*" will not match a digit in the last position. So like this:
/\s[^.,]*[^\d](\d+[,.]\d+)[^\d]/m
Here, "[^.,]*[^\d]" means any string of 0 or more characters that does not contain a "." (dot) or a "," (comma) where the last character does not contain a digit.

Try this
value.replace( /\s*.(\d+[,.]\d+)[^\d]/m, "$1");
WORKING REGEX
Output:

The .* matches greedily and therefore matches as many characters, including digits, as it can, as long as the rest of the pattern can still match.
The rest of the pattern can still match if just one digit is left for the /d+ to match, so you only end up with one digit there.
If the semicolon in your example is always in that position in the strings you wish to match, use it as a marker like this
value.replace(/.*;(\d+[,\.]\d+).*/m, "$1");

Related

Regex for "replace all after first 2 digits after comma"

I would like to replace all characters after the first 2 digits after a comma.
E.g. having a string of 1234,56789 should result into 1234,56.
Using [^,]*$ has led me to the right path, but deleting everything after the comma.
A [^,]..$ doesnt give me a correct result too, thus I need a way to tell my expression that "the first 2 digits after the comma" got to be deleted, not "the last 2 digits" since thats what the ".." seems to do in my expression.
You can use
(,\d{2}).*
The regex matches and captures into Group 1 a comma and two digits, and just matches the rest of the line with .*.
To remove only after last comma:
(.*,\d{2}).*
Here, .* at the start captures also everything at the start of the string.
A more retrictive pattern will be
^(\d+,\d{2})\d*$
It matches start of string (with ^), then one or more digits (with \d+), a comma, two digits, all captured into Group 1, and then just matches zero or more digits (with \d*) at the end of the string ($).
Replace with $1 (or \1 depending on the regex engine). See the regex demo (also this one and this one, too).
You can use:
import re
re.sub(r',(\d{2}).*', r',\1', a)

Regular expressions in notepad++ (Search and Replace)

I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.
I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):
The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.
You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1
If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After

Regex not returning all matches

I have the following regex (my actual regex is actually a lot more complex but I pinned down my problem to this): \s(?<number>123|456)\s
And the following test data:
" 123 456 "
As expected/wanted result I would have the regex match in 2 matches one with "number" being "123" and the second with number being "456". However, I'm only getting 1 match with "number" being "123".
I did notice that adding another space in between "123" en "456" in the test data does give 2 matches...
Why don't I get the result I want? How to get it right?
Your pattern contains consuming \s patterns that matches a whitespace before and after a number, and the input contains consecutive numbers separated with a single whitespace. If there were two spaces between the numbers, it would work.
Use whitespace boundaries based on lookarounds:
(?<!\S)(?<number>123|456)(?!\S)
See the regex demo
The (?<!\S) is a negative lookbehind that will fail the match if there is a non-whitespace char immediately to the left of the current location, and (?!\S) is a negative lookahead that will fail the match if there is a non-whitespace char immediately to the right of the current location.
(?<!\S) is the same as (?<=^|\s) and (?!\S) is the same as (?=$|\s), but more efficient.
Note that in many situations you might even go with 1 lookahead and use
\s(?<number>123|456)(?!\S)
It will ensure the consecutive whitespace separated matches are found.

Regex expression for digit followed by dot (.)

I want to find a text with with digit followed by a dot and replace it with the same text (digit with dot) and "xyz" string.
For ex.
1. This is a sample
2. test
3. string
**I want to change it to**
1.xyz This is a sample
2.xyz test
3.xyz string
I learnt how to find the matching text (\d.) but the challenge is to find the replace with text.
I'm using notepad ++ editor for this, can anyone suggest the "Replace with" string.
First of all, you need to escape the dot since it means "match anything (except newline depending if the s modifier is set)": (\d\.).
Second, you need to add a quantifier in case you have a 2 digit number or more: (\d+\.).
Third, we don't need group 1 in this case: \d+\..
In the replacement, it's quite simple: just use $0xyz. $0 will refer to group 0 which is the whole match.
For notepad++...
You must escape the period/dot character in the expression - precede it with a backslash:
\.
In my case, I needed to find all instances of "{EnvironmentName}.api.mycompany.com"
(dev.api.mycompany.com, stage.api.mycompany.com, prod.api.mycompany, etc.)
I used this search expression:
.*\.api.mycompany.com
Notepad++ RegEx Search Screenshot
I think the right answer is as follow:
Find: ^(\d)([.])(\s)
Replace: $1$2XYZ
That will work with "n. " being "n" a digit [0-9]. If the input should accept digits with different lengths like 10, 100, 1000... or multiples dots "." after the digit or multiple spaces after the dot, then the answer is:
Find: ^(\d*)([.])([.]*)(\s*)
Replace: $1$2XYZ
Input:
1. This is a sample
2. test
3. string
30. string
10..... string
50005... string
Output:
1.XYZ This is a sample
2.XYZ test
3.XYZ string
30.XYZ string
10.XYZ string
50005.XYZ string

regex: find one-digit number

I need to find the text of all the one-digit number.
My code:
$string = 'text 4 78 text 558 my.name#gmail.com 5 text 78998 text';
$pattern = '/ [\d]{1} /';
(result: 4 and 5)
Everything works perfectly, just wanted to ask it is correct to use spaces?
Maybe there is some other way to distinguish one-digit number.
Thanks
First of all, [\d]{1} is equivalent to \d.
As for your question, it would be better to use a zero width assertion like a lookbehind/lookahead or word boundary (\b). Otherwise you will not match consecutive single digits because the leading space of the second digit will be matched as the trailing space of the first digit (and overlapping matches won't be found).
Here is how I would write this:
(?<!\S)\d(?!\S)
This means "match a digit only if there is not a non-whitespace character before it, and there is not a non-whitespace character after it".
I used the double negative like (?!\S) instead of (?=\s) so that you will also match single digits that are at the beginning or end of the string.
I prefer this over \b\d\b for your example because it looks like you really only want to match when the digit is surrounded by spaces, and \b\d\b would match the 4 and the 5 in a string like 192.168.4.5
To allow punctuation at the end, you could use the following:
(?<!\S)\d(?![^\s.,?!])
Add any additional punctuation characters that you want to allow after the digit to the character class (inside of the square brackets, but make sure it is after the ^).
Use word boundaries. Note that the range quantifier {1} (a single \d will only match one digit) and the character class [] is redundant because it only consists of one character.
\b\d\b
Search around word boundaries:
\b\d\b
As explained by the others, this will extract single digits meaning that some special characters might not be respected like "." in an ip address. To address that, see F.J and Mike Brant's answer(s).
It really depends on where the numbers can appear and whether you care if they are adjacent to other characters (like . at the end of a sentence). At the very least, I would use word boundaries so that you can get numbers at the beginning and end of the input string:
$pattern = '/\b\d\b/';
But you might consider punctuation at the end like:
$pattern = '/\b\d(\b|\.|\?|\!)/';
If one-digit numbers can be preceded or followed by characters other than digits (e.g., "a1 cat" or "Call agent 7, pronto!") use
(?<!\d)\d(?!\d)
Demo
The regular expression reads, match a digit (\d) that is neither preceded nor followed by digit, (?<!\d) being a negative lookbehind and (?!\d) being a negative lookahead.