Regular Expression find space delimited numbers - regex

I have a string that comes from user input through a messaging system, this can contain a series of 4 digit numbers, but as users are likely to type things in wrong it needs to be a little bit flexible.
Therefore I want to allow them to type in the numbers, or pepper their message with any string of characters and then just take the numbers that match the formats
=nnnn or nnnn
For this I have the Regular Expression:
(^|=|\s)\d{4}(\s|$)
Which almost works, however as it says that each group of 4 digits must start with an =, a space, or the start of the string it misses every other set of numbers
I tried this:
(^|=|\s*)\d{4}(\s|$)
But that means that any four digits followed by a space get matched - which is incorrect.
How can I match groups of numbers, but include a single space at the end of one group, and the beginning of the next, to clarify this string:
Ack 9876 3456 3467 4578 4567
Should produce the matches:
9876
3456
3467
4578
4567

Here you need to use lookarounds which won't consume any characters.
(?:^|[=\s])\K\d{4}(?=\s|$)
OR
(?:^|[=\s])(\d{4})(?=\s|$)
DEMO
Your regex (^|=|\s)\d{4}(\s|$) fails because at first this would match <space>9876<space> then it would look for another space or equals or start of the line. So now it finds the next match at <space>3467<space>. It won't match 3456 because the space before 3456 was already consumed in the first match. In-order to do overlapping matches, you need to put the pattern inside positive lookarounds. So when you put the last pattern (\s|$) inside lookahead, it won't consume the space, it just asserts that the match must be followed by a space or end of the line boundary.

\b\d+\b
\b asserts position at a word boundary (^\w|\w$|\W\w|\w\W). It is a 0-width anchor, much like ^ and $. It doesn't consume any characters.
Demo
or
(?:^|(?<=[=\s]))\d{4}\b
Demo

Related

Regex how can i get only exact part in a string

I should only catch numbers which are fit the rules.
Rules:
it should be 16 digit
first 11 digit can be any number
after 3 digit should have all zero
last two digit can be any number.
I did this way;
([0-9]{11}[0]{3}[0-9]{2})
number example:
1234567890100012
now I want to get the number even it has got any letter beginning or ending of the string like " abc1234567890100012abc"
my output should be just number like "1234567890100012"
When I add [a-zA-Z]* it gives all string.
Also another point is if there is any number beginning or ending of the string like "999912345678901000129999". program shouldn't take this. I mean It should return none or nothing. How can I write this with regex.
You can use look around to exclude the cases where there are more digits before/after:
(?<!\d)\d{11}000\d\d(?!\d)
On regex101
You can use a capture group, and match optional chars a-zA-Z before and after the group.
To prevent a partial match, you can use word boundaries \b or if the string should match from the start and end of the line you can use anchors ^ and $
\b[a-zA-Z]*([0-9]{11}000[0-9]{2})[a-zA-Z]*\b
Regex demo

Regex not returning all matches

I have the following regex (my actual regex is actually a lot more complex but I pinned down my problem to this): \s(?<number>123|456)\s
And the following test data:
" 123 456 "
As expected/wanted result I would have the regex match in 2 matches one with "number" being "123" and the second with number being "456". However, I'm only getting 1 match with "number" being "123".
I did notice that adding another space in between "123" en "456" in the test data does give 2 matches...
Why don't I get the result I want? How to get it right?
Your pattern contains consuming \s patterns that matches a whitespace before and after a number, and the input contains consecutive numbers separated with a single whitespace. If there were two spaces between the numbers, it would work.
Use whitespace boundaries based on lookarounds:
(?<!\S)(?<number>123|456)(?!\S)
See the regex demo
The (?<!\S) is a negative lookbehind that will fail the match if there is a non-whitespace char immediately to the left of the current location, and (?!\S) is a negative lookahead that will fail the match if there is a non-whitespace char immediately to the right of the current location.
(?<!\S) is the same as (?<=^|\s) and (?!\S) is the same as (?=$|\s), but more efficient.
Note that in many situations you might even go with 1 lookahead and use
\s(?<number>123|456)(?!\S)
It will ensure the consecutive whitespace separated matches are found.

About this regular expression (?<=\d)\d{4}

I use (?<=\d)\d{4} to match 1234567890, the result is 2345 6789.
Why it's not 2345 7890?
In the second match, it starts from 6 and 6 is matched by (?<=\d), so I think the result is 7890 rather than 6789.
Besides, how about using ((?<=\d)\d{3})+ match 1234567890?
Look behinds are non consuming, so the 5 is being "reused" in the second match (even though the first match consumed it).
If you want to start at 6, consume but don't capture:
\d(\d{4})
And use group 1, or if your regex engine supports it, use a negative look behind for \G, which is the end of the previous match:
(?!\G)(?<=\d)\d{4}
See a live demo.
(?<=\d) is Zero-Length Assertion, assertions do not consume characters in the string, but only assert whether a match is possible or not.
It matches this way as the first match finishes at 5 so the next group can be matched from 6. (?<=\d) matches 5 in this case and the match is on 6789, starting with 6.
(?<=\d) doesn't belong to the match, it doesn't consume a character, it's just asserting what is in front of the match.
(?<=\d)\d{4}
?<= Lookbehind. Makes sure a digit precedes the text to be matched.
What text are we matching ? d{4} So, Meaning is match those 4 digits which are preceded by one digit.
In 1234567890 such a match is 2345 as it is preceded by 1 Now we have got one match and the string to be matched still is 1234567890 Now checking the regex condition will again tell to find group of four digits which has a prefix as a digit. Since 2345 has already been matched, the next successful match is 6789 which is preceded by 5 satisfying the regex conditions.
Coming to (?<=\d)\d{3} it does the same thing as before only it makes a group of 3. Editing this regex to get the one mentioned by you, we add the whole thing in a capture group. ((?<=\d)\d{3}) and say one or more of this ((?<=\d)\d{3})+. A repeated capturing group will only capture the last iteration.
So 890 is returned as a match.

regex: find one-digit number

I need to find the text of all the one-digit number.
My code:
$string = 'text 4 78 text 558 my.name#gmail.com 5 text 78998 text';
$pattern = '/ [\d]{1} /';
(result: 4 and 5)
Everything works perfectly, just wanted to ask it is correct to use spaces?
Maybe there is some other way to distinguish one-digit number.
Thanks
First of all, [\d]{1} is equivalent to \d.
As for your question, it would be better to use a zero width assertion like a lookbehind/lookahead or word boundary (\b). Otherwise you will not match consecutive single digits because the leading space of the second digit will be matched as the trailing space of the first digit (and overlapping matches won't be found).
Here is how I would write this:
(?<!\S)\d(?!\S)
This means "match a digit only if there is not a non-whitespace character before it, and there is not a non-whitespace character after it".
I used the double negative like (?!\S) instead of (?=\s) so that you will also match single digits that are at the beginning or end of the string.
I prefer this over \b\d\b for your example because it looks like you really only want to match when the digit is surrounded by spaces, and \b\d\b would match the 4 and the 5 in a string like 192.168.4.5
To allow punctuation at the end, you could use the following:
(?<!\S)\d(?![^\s.,?!])
Add any additional punctuation characters that you want to allow after the digit to the character class (inside of the square brackets, but make sure it is after the ^).
Use word boundaries. Note that the range quantifier {1} (a single \d will only match one digit) and the character class [] is redundant because it only consists of one character.
\b\d\b
Search around word boundaries:
\b\d\b
As explained by the others, this will extract single digits meaning that some special characters might not be respected like "." in an ip address. To address that, see F.J and Mike Brant's answer(s).
It really depends on where the numbers can appear and whether you care if they are adjacent to other characters (like . at the end of a sentence). At the very least, I would use word boundaries so that you can get numbers at the beginning and end of the input string:
$pattern = '/\b\d\b/';
But you might consider punctuation at the end like:
$pattern = '/\b\d(\b|\.|\?|\!)/';
If one-digit numbers can be preceded or followed by characters other than digits (e.g., "a1 cat" or "Call agent 7, pronto!") use
(?<!\d)\d(?!\d)
Demo
The regular expression reads, match a digit (\d) that is neither preceded nor followed by digit, (?<!\d) being a negative lookbehind and (?!\d) being a negative lookahead.

How do I write a regex that won't match a certain amount of whitespace?

I'm trying to write a regex that won't match a certain number of white spaces, but it's not going the way I expected.
I have these strings:
123 99999 # has 6 white spaces
321 99999 # same
123 8888 # has 3 white spaces \
321 8888 # same | - These are the lines I
1237777 | want to match
3217777 /
I want to match the last four lines, i.e. starts with 123 or 321 followed by anything but 6 whitespace characters:
^(123|321)[^\ ]{6}.*
This doesn't seem to do the trick - this matches only the two last ones. What am I missing?
" 888"
If you match this up, this does not match [^\ ]{6}: this is saying
[not a space][not a space][not a space][not a space][not a space][not a space]
In this case, you have the problem that the first 3 characters are a space, so it's not matching up right.
You can use a negative lookahead ^(123)|(321)(?!\s{6}). What I prefer because it is more readable, is to write the regular expression to match what you don't want, then negate (i.e., not, !, etc.). I don't know enough about your data, but I would do use \s{6}, then negate it.
Try this:
^(123|321)(?!\s{6}).*
(uses a negative lookahead so see if there are 6 whitespaces in .* match)
What language are you doing this in? If in Perl or something that supports PCREs, you can simply use a negative lookahead assertion:
^(123)|(321)(?!\ {6}).*
You need to first say that it may have 3 whitespaces and then deny the existence of the three more whitespaces, like this:
^([0-9]+)(\s{0,3})([^ ]{3})([0-9]*)$
^([0-9]+) = Accepts one or more numbers in the beginning of your string.
(\s{0,3}) = Accepts zero or up to three spaces.
([^ ]{3}) = Disallow the next 3 spaces after the allowed spaces.
([0-9]*) = Accepts any number after spaces till the end of your string.
Or:
^([0-9]+)(\s{0,3})(?!\s+)([0-9]*)$
The only change here is that after the three allowed spaces it won't accept any more spaces (I particularly like this second option more because it's more readable).
Hope it helps.