Regex to match list of numbers - regex

I'm trying to write a regex to match a very long list of numbers separated by commas and an optional space. It can't match a single integer. The list of numbers is approx 7000 bytes long bounded by text on either side.
12345 => don't match
12345,23456,34567,45678 => match
12345, 23456, 34567, 45678 => match
My current regex,
(?<!\.)(([0-9]+,)+[0-9]+)(?!\.)
causes a stack overflow. A few I have tried so far are:
([0-9,]+) => doesn't match with optional spaces
((\d+,[ ]?)+\d+) => worse than the original
[ ]([0-9, ]+)[ ] => can't be certain the numbers will be bounded by spaces
I'm using https://regex101.com/ to test the number of steps each regex takes, the original is approx 3000 steps.
Example (elided) string:
Processing 145363,145386,145395,145422,145463,145486 from batch 59
Any help would be appreciated.

You can use this regex:
^\d+(?:[ \t]*,[ \t]*\d+)+$
RegEx Demo
\d+ matches 1 or more digits
(?:...)+ matches 1 or more of following numbers separated by comma optionally surrounded with space/tab.

(\d+,\s*)+\d+
\d+,\s* matches all the numbers with a comma followed by a space/nospace. However we need to lookout for the last number which doesn't have a "," as in the above group. So end it with last number by \d+.

How about
(?:\d+,\s*)+\d+
Breakdown:
(?: # begin group
\d+ # digits
,\s* # ",", optional whitespace
)+ # end group, repeat
\d+ # digits (last item in the list)
Note that \s includes whitespace characters besides space and tab, most notably line breaks (\n). Use [ \t] in place of \s to prevent false positives, if your input requires it.

Related

Regex to match numbers, commas, dots, spaces in a given string

I'm writing a regex to match any numbers, commas, dots, except when they are at the end of the number.
Here is an example of what I have so far:
/([0-9]+[., ]*)+/
This is pretty good already because it is matching what I want. The only issue is that it's matching ' ' or ',' '.' at the end of the expression too.
Let's say I have this string:
The cost of the food was 1 999,49 € without drinks.
I want to match the 1 999,49 string. Right now my regexp is matching 1 999,49 . The same should happen if the format of the price is different like:
1,999.49 $ => 1,999.49 (with no whitespace or anything in the end)
How can I do this with regular expressions?
You might use a pattern to first match the digits and optionally match either a space, comma or dot followed by 1+ digit so that the dot comma or space can not be at the end.
\d+(?:[,. ]\d+)*
\d+ Match 1+ digits
(?: Non capture group
[,. ]\d+ Match either a space , or . and 1+ digits
)* Close group and repeat 0+ times
Regex demo
A bit more precise match could be
\b\d{1,3}(?:[,. ]\d{3})*(?:[.,]\d{2})?\b
Regex demo

Regex match an optional number of digits

I have a list that could look sort of like
("!Goal 27' Edward Nketiah"),
("!Goal 33' 46' Pierre Emerick-Aubameyang"),
("!Sub Nicolas Pepe"),
("Jordan Pickford"),
and I'm looking to match either !Sub or !Goal 33' 46' or !Goal 27'
Right now I'm using the regex (!\w+\s) which will match !Goal and !Sub, but I want to be able to get the timestamps too. Is there an easy way to do that? There is no limit on the number of timestamps there could be.
As I mentioned in my comment, you can use the following regex to accomplish this:
(!\w+(?:\s\d+')*)
Explanation:
(!\w+(?:\s\d+')*) capture the following
! matches this character literally
\w+ matches one or more word characters
(?:\s\d+')* match the following non-capture group zero or more times
\s match a whitespace character
\d+ matches one or more digits
' match this character literally
Additionally, the first capture group isn't necessary - you can remove it to simply match:
!\w+(?:\s\d+')*
If you need each timestamp, you can use !\w+(\s\d+')* and split capture group 1 on the space character.
If your input always follows the format "bang text blank digits apostrophe blank digits apostrophe etc", then it should be as simple as:
!\w+(?:\s\d+')*
Explanation:
! matches an exclamation mark
\w+ matches 1 or more word-characters (letters, underscores)
(?:…) is a non-capturing group
\s matches a single whitespace character
\d+ matches one or more digits
' matches the apostrophe character
* repeatedly matches the group 0 or more times
this :
(!\w+(?:\s\d+')*)
will capture :
"!Goal 27'"
"!Goal 33' 46'"
"!Sub"

perl regex - catching four numbers separated by space

I know that we can use regex in perl to catch numbers using [\d], but my pattern is like this:
261 193 546 302
or it could be like this:
16 0 98 120
The point is - I just want to catch a line that has any four numbers separated by a space. Each number can be made up of any number of digits, it could be a single-digit number, or a double-digit number, and so on.
^\d+(?:\s+\d+){3}$
Try this.This should do it for you.
You don't explicitly have to wrap the token inside of a character class. And for this you want to assert the start of the string and end of the string positions, so I would use anchors and quantify a non-capturing group "3" times.
^\d+(?: \d+){3}$
Explanation:
^ # the beginning of the string
\d+ # digits (0-9) (1 or more times)
(?: # group, but do not capture (3 times):
# ' '
\d+ # digits (0-9) (1 or more times)
){3} # end of grouping
$ # before an optional \n, and the end of the string
Based on your requirements to "catch a line that has any four numbers separated by a space". I would use the following as it contains a capture group which will contain your number sequence and will ignore any leading or tailing spaces.
((?:\d+\s){3}\d+)
REGEX101
Usage in Perl
$re = "/((?:\\d+\\s){3}\\d+)/";
As you can see it will match exactly 4 numbers separated by a single space and will ignore preceding and trailing characters.
Alternate
If you where being explicit and actually want to capture the whole line including any other characters this will be better suited.
(^.*(?:\d+\s){3}\d+.*$)
REGEX101
Usage In Perl
$re = "/(^.*(?:\\d+\\s){3}\\d+.*$)/mx";
Note this will match numbers with decimal places due to the way it is structured.
Try ^\d+\s\d+\s\d+\s\d+$. That will match 4 numbers with spaces and nothing else.
Sample

Regular expression captures unwanted string

I have created the following expression: (.NET regex engine)
((-|\+)?\w+(\^\.?\d+)?)
hello , hello^.555,hello^111, -hello,+hello, hello+, hello^.25, hello^-1212121
It works well except that :
it captures the term 'hello+' but without the '+' : this group should not be captured at all
the last term 'hello^-1212121' as 2 groups 'hello' and '-1212121' both should be ignored
The strings to capture are as follows :
word can have a + or a - before it
or word can have a ^ that is followed by a positive number (not necessarily an integer)
words are separated by commas and any number of white spaces (both not part of the capture)
A few examples of valid strings to capture :
hello^2
hello^.2
+hello
-hello
hello
EDIT
I have found the following expression which effectively captures all these terms, it's not really optimized but it just works :
([a-zA-Z]+(?= ?,))|((-|\+)[a-zA-Z]+(?=,))|([a-zA-Z]+\^\.?\d+)
Ok, there are some issues to tackle here:
((-|+)?\w+(\^.?\d+)?)
^ ^
The + and . should be escaped like this:
((-|\+)?\w+(\^\.?\d+)?)
Now, you'll also get -1212121 there. If your string hello is always letters, then you would change \w to [a-zA-Z]:
((-|\+)?[a-zA-Z]+(\^\.?\d+)?)
\w includes letters, numbers and underscore. So, you might want to restrict it down a bit to only letters.
And finally, to take into consideration of the completely not capturing groups, you'll have to use lookarounds. I don't know of anyway otherwise to get to the delimiters without hindering the matches:
(?<=^|,)\s*((-|\+)?[a-zA-Z]+(\^\.?\d+)?)\s*(?=,|$)
EDIT: If it cannot be something like -hello^2, and if another valid string is hello^9.8, then this one will fit better:
(?<=^|,)\s*((?:-|\+)?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)(?=\s*(?:,|$))
And lastly, if capturing the words is sufficient, we can remove the lookarounds:
([-+]?[a-zA-Z]+|[a-zA-Z]+\^(?:\d+)?\.?\d+)
It would be better if you first state what it is you are looking to extract.
You also don't indicate which Regular Expression engine you're using, which is important since they vary in their features, but...
Assuming you want to capture only:
words that have a leading + or -
words that have a trailing ^ followed by an optional period followed by one or more digits
and that words are sequences of one or more letters
I'd use:
([a-zA-Z]+\^\.?\d+|[-+][a-zA-Z]+)
which breaks down into:
( # start capture group
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
\^ # literal
\.? # optional period
\d+ # one or more digits
| # OR
[+-]? # optional plus or minus
[a-zA-Z]+ # one or more letters or underscores
) # end of capture group
EDIT
To also capture plain words (without leading or trailing chars) you'll need to rearrange the regexp a little. I'd use:
([+-][a-zA-Z]+|[a-zA-Z]+\^(?:\.\d+|\d+\.\d+|\d+)|[a-zA-Z]+)
which breaks down into:
( # start capture group
[+-] # literal plus or minus
[a-zA-Z]+ # one or more letters - note \w matches numbers and underscores
| # OR
[a-zA-Z]+ # one or more letters
\^ # literal
(?: # start of non-capturing group
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
\. # literal period
\d+ # one or more digits
| # OR
\d+ # one or more digits
) # end of non-capturing group
| # OR
[a-zA-Z]+ # one or more letters
) # end of capture group
Also note that, per your updated requirements, this regexp captures both true non-negative numbers (i.e. 0, 1, 1.2, 1.23) as well as those lacking a leading digit (i.e. .1, .12)
FURTHER EDIT
This regexp will only match the following patterns delimited by commas:
word
word with leading plus or minus
word with trailing ^ followed by a positive number of the form \d+, \d+.\d+, or .\d+
([+-][A-Za-z]+|[A-Za-z]+\^(?:.\d+|\d+(?:.\d+)?)|[A-Za-z]+)(?=,|\s|$)
Please note that the useful match will appear in the first capture group, not the entire match.
So, in Javascript, you'd:
var src="hello , hello ,hello,+hello,-hello,hello+,hello-,hello^1,hello^1.0,hello^.1",
RE=/([+-][A-Za-z]+|[A-Za-z]+\^(?:\.\d+|\d+(?:\.\d+)?)|[A-Za-z]+)(?=,|\s|$)/g;
while(RE.test(src)){
console.log(RegExp.$1)
}
which produces:
hello
hello
hello
+hello
-hello
hello^1
hello^1.0
hello^.1

Regex to find repeating numbers

Can anyone help me or direct me to build a regex to validate repeating numbers
eg : 11111111, 2222, 99999999999, etc
It should validate for any length.
\b(\d)\1+\b
Explanation:
\b # match word boundary
(\d) # match digit remember it
\1+ # match one or more instances of the previously matched digit
\b # match word boundary
If 1 should also be a valid match (zero repetitions), use a * instead of the +.
If you also want to allow longer repeats (123123123) use
\b(\d+)\1+\b
If the regex should be applied to the entire string (as opposed to finding "repeat-numbers in a longer string), use start- and end-of-line anchors instead of \b:
^(\d)\1+$
Edit: How to match the exact opposite, i. e. a number where not all digits are the same (except if the entire number is simply a digit):
^(\d)(?!\1+$)\d*$
^ # Start of string
(\d) # Match a digit
(?! # Assert that the following doesn't match:
\1+ # one or more repetitions of the previously matched digit
$ # until the end of the string
) # End of lookahead assertion
\d* # Match zero or more digits
$ # until the end of the string
To match a number of repetitions of a single digit, you can write ([0-9])\1*.
This matches [0-9] into a group, then matches 0 or more repetions (\1) of that group.
You can write \1+ to match one or more repetitions.
Use a backreference:
(\d)\1+
Probably you want to use some sort of anchors ^(\d)\1+$ or \b(\d)\1+\b
I used this expression to give me all phone numbers that are all the same digit.
Basically, it means to give 9 repetitions of the original first repetition of a given number, which results in 10 of the same number in a row.
([0-9])\1{9}
(\d)\1+? matches any digit repeating
you can get repeted text or numbers easily by backreference take a look on following example:
this code simply means whatever the pattern inside [] . ([inside pattern]) the \1 will go finding same as inside pattern forward to that.